I'm opening this topic for everyone to list some Big data* sets available over the net.
Best
Altan
* Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples...
--------------
[Edited by a Moderator]
We've compiled the responses to this thread into the following Knowledge Base article: Available "Big Data Sets" on the Web
--------------
Solved! Go to Solution.
Great suggestion. It would be great if collectively we can find a few free, public big data sets that can be used for examples of different techniques in Alteryx as well.
I'll add a link for the GDELT set, which was used for the 2015 Tableau IronViz competition at their conference. Info on that data set can be found here.
Amazon (AWS) has a Large Data Sets Repository.
Data.gov has close to 190k public data sets. Of course not all of the sets there qualify as 'big' data, but it;s a great source of free data.
One of the standard datasets for Hadoop is the Enron email dataset comprising emails between Enron employees during the scandal. It's a great practice dataset for dealing with semi-structured data (file scraping, regexes, parsing, joining, etc.).
It's ~400MB (compressed) and available for download at http://www.cs.cmu.edu/~enron/.
http://labrosa.ee.columbia.edu/millionsong/ It is acollection of audio features and metadata for a million contemporary popular music tracks. Purposes are:
The Million Song Dataset is a cluster of complementary datasets contributed by the community:
You can either download the entire dataset (280 GB) or a subset of 10,000 songs (1.8) for a quick taste.
NY City taxi data sets 1.1BN records
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance
An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data
Airline data set 1987-2208
https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O
Cristian.
Google Big Table - hosted by Google
https://cloud.google.com/bigquery/sample-tables
Name Description
gsod | Contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010. |
github_nested | Contains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012. |
github_timeline | Contains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012. |
137,826,763 rows | Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008. |
shakespeare | Contains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus. |
trigrams | Contains English language trigrams from a sample of works published between 1520 and 2008. |
313,797,035 rows | Contains the complete revision history for all Wikipedia articles up to April 2010. |
The natality dataset was used to illustrate a blog post related to frequency of births spread by months' year.
http://knowmore.washingtonpost.com/2015/03/31/chart-winter-really-is-baby-making-time/
I always love the Kaggle Datasets (https://www.kaggle.com/).
Datasets for all kind of different subjects (online, stocks, retail, health etc. etc.)
Financial loans' public data:
BONDORA
https://www.bondora.ee/en/invest/statistics/data_export
LENDING CLUB
https://www.lendingclub.com/info/download-data.action
Regards,
Cristian
Hi all,
the link I'll provide is not an actual data set, it is a data set generator that creates simulated call data (CDR) records,
if you happen to model telecom behavioural segmentation models, propensity to churn models or mobility etc. you may start playin with that I suppose...
http://www.gedis-studio.com/online-call-detail-records-cdr-generator.html
Best
Altan
Here is a telecom Italia dataset as a result of a computation over the Call Detail Records (CDRs)
generated by the Telecom Italia cellular network over the city of Milano;
You may have to sign-in and activate your account but it's totally free...
https://dandelion.eu/datagems/SpazioDati/telecom-sms-call-internet-mi/description/