General Discussions

Atabarezz · ‎11-22-2015

I'm opening this topic for everyone to list some Big data* sets available over the net.

Feel free to list competion/datathon data sets
Results of web scraping
Social media data
Anything bigger than 1 mio records (beyond excel and access)

Best

Altan

* Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples...

--------------

[Edited by a Moderator]

We've compiled the responses to this thread into the following Knowledge Base article: Available "Big Data Sets" on the Web

--------------

dataMack · ‎11-22-2015

Great suggestion. It would be great if collectively we can find a few free, public big data sets that can be used for examples of different techniques in Alteryx as well.

I'll add a link for the GDELT set, which was used for the 2015 Tableau IronViz competition at their conference. Info on that data set can be found here.

Amazon (AWS) has a Large Data Sets Repository.

Data.gov has close to 190k public data sets. Of course not all of the sets there qualify as 'big' data, but it;s a great source of free data.

SteveA · ‎11-23-2015

One of the standard datasets for Hadoop is the Enron email dataset comprising emails between Enron employees during the scandal. It's a great practice dataset for dealing with semi-structured data (file scraping, regexes, parsing, joining, etc.).

It's ~400MB (compressed) and available for download at http://www.cs.cmu.edu/~enron/.

Atabarezz · ‎11-25-2015

http://labrosa.ee.columbia.edu/millionsong/ It is acollection of audio features and metadata for a million contemporary popular music tracks. Purposes are:

To encourage research on algorithms that scale to commercial sizes
To provide a reference dataset for evaluating research
As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's)

The Million Song Dataset is a cluster of complementary datasets contributed by the community:

SecondHandSongs dataset -> cover songs
musiXmatch dataset -> lyrics
Last.fm dataset -> song-level tags and similarity
Taste Profile subset -> user data
thisismyjam-to-MSD mapping -> more user data
tagtraum genre annotations -> genre labels
Top MAGD dataset -> more genre labels

You can either download the entire dataset (280 GB) or a subset of 10,000 songs (1.8) for a quick taste.

Cristian · ‎12-18-2015

NY City taxi data sets 1.1BN records

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance
An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data

Airline data set 1987-2208

https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O

Cristian.

Cristian · ‎12-22-2015

Google Big Table - hosted by Google

https://cloud.google.com/bigquery/sample-tables

Name Description

gsod	Contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.
github_nested	Contains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012.
github_timeline	Contains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012.
natality 137,826,763 rows	Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.
shakespeare	Contains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.
trigrams	Contains English language trigrams from a sample of works published between 1520 and 2008.
wikipedia 313,797,035 rows	Contains the complete revision history for all Wikipedia articles up to April 2010.

The natality dataset was used to illustrate a blog post related to frequency of births spread by months' year.

http://knowmore.washingtonpost.com/2015/03/31/chart-winter-really-is-baby-making-time/

_hans1 · ‎12-23-2015

I always love the Kaggle Datasets (https://www.kaggle.com/).

Datasets for all kind of different subjects (online, stocks, retail, health etc. etc.)

Cristian · ‎12-31-2015

Financial loans' public data&colon;

BONDORA

https://www.bondora.ee/en/invest/statistics/data_export

LENDING CLUB

https://www.lendingclub.com/info/download-data.action

Regards,

Cristian

Atabarezz · ‎01-03-2016

Hi all,

the link I'll provide is not an actual data set, it is a data set generator that creates simulated call data (CDR) records,

if you happen to model telecom behavioural segmentation models, propensity to churn models or mobility etc. you may start playin with that I suppose...

http://www.gedis-studio.com/online-call-detail-records-cdr-generator.html

Best

Altan

Atabarezz · ‎01-03-2016

Here is a telecom Italia dataset as a result of a computation over the Call Detail Records (CDRs)

generated by the Telecom Italia cellular network over the city of Milano;

You may have to sign-in and activate your account but it's totally free...

https://dandelion.eu/datagems/SpazioDati/telecom-sms-call-internet-mi/description/

General Discussions

Available "Big data sets" over the internet...