Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

General Discussions

Discuss any topics that are not product-specific here.
SOLVED

Available "Big data sets" over the internet...

Atabarezz
13 - Pulsar

I'm opening this topic for everyone to list some Big data* sets available over the net.

 

  • Feel free to list competion/datathon data sets
  • Results of web scraping
  • Social media data
  • Anything bigger than 1 mio records (beyond excel and access)

 

Best

 

Altan

 

* Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples... 

 

--------------

[Edited by a Moderator]

We've compiled the responses to this thread into the following Knowledge Base article: Available "Big Data Sets" on the Web

--------------

62 REPLIES 62
dataMack
12 - Quasar

Great suggestion. It would be great if collectively we can find a few free, public big data sets that can be used for examples of different techniques in Alteryx as well.

 

I'll add a link for the GDELT set, which was used for the 2015 Tableau IronViz competition at their conference.  Info on that data set can be found here.

 

Amazon (AWS) has a Large Data Sets Repository.

 

Data.gov has close to 190k public data sets.  Of course not all of the sets there qualify as 'big' data, but it;s a great source of free data.

SteveA
Alteryx
Alteryx

One of the standard datasets for Hadoop is the Enron email dataset comprising emails between Enron employees during the scandal.  It's a great practice dataset for dealing with semi-structured data (file scraping, regexes, parsing, joining, etc.).

 

It's ~400MB (compressed) and available for download at http://www.cs.cmu.edu/~enron/.

 

Atabarezz
13 - Pulsar

http://labrosa.ee.columbia.edu/millionsong/ It is acollection of audio features and metadata for a million contemporary popular music tracks. Purposes are:

  • To encourage research on algorithms that scale to commercial sizes
  • To provide a reference dataset for evaluating research
  • As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's)

The Million Song Dataset is a cluster of complementary datasets contributed by the community:

You can either download the entire dataset (280 GB) or a subset of 10,000 songs (1.8) for a quick taste.

 

Cristian
9 - Comet

NY City taxi data sets 1.1BN records

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

 

Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance
An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data

 

Airline data set 1987-2208

https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O

 

Cristian.

 

 

Cristian
9 - Comet

Google Big Table - hosted by Google

https://cloud.google.com/bigquery/sample-tables

 

Name Description

gsodContains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.
github_nestedContains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012.
github_timelineContains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012.

natality

137,826,763 rows

Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.
shakespeareContains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.
trigramsContains English language trigrams from a sample of works published between 1520 and 2008.

wikipedia

313,797,035 rows

Contains the complete revision history for all Wikipedia articles up to April 2010.

 

 

The natality dataset was used to illustrate a blog post related to frequency of births spread by months' year.

http://knowmore.washingtonpost.com/2015/03/31/chart-winter-really-is-baby-making-time/

_hans1
7 - Meteor

I always love the Kaggle Datasets (https://www.kaggle.com/).

 

Datasets for all kind of different subjects (online, stocks, retail, health etc. etc.)

Cristian
9 - Comet

Financial loans' public data:

 

BONDORA

https://www.bondora.ee/en/invest/statistics/data_export

 

LENDING CLUB

https://www.lendingclub.com/info/download-data.action

 

Regards,

Cristian

Atabarezz
13 - Pulsar

Hi all,

 

the link I'll provide is not an actual data set, it is a data set generator that creates simulated call data (CDR) records,

if you happen to model telecom behavioural segmentation models, propensity to churn models or mobility etc. you may start playin with that I suppose...

 

http://www.gedis-studio.com/online-call-detail-records-cdr-generator.html

 

Best

 

Altan

Atabarezz
13 - Pulsar

Here is a telecom Italia dataset as a result of a computation over the Call Detail Records (CDRs)

generated by the Telecom Italia cellular network over the city of Milano;

 

You may have to sign-in and activate your account but it's totally free...

 

https://dandelion.eu/datagems/SpazioDati/telecom-sms-call-internet-mi/description/

Labels