General Discussions

Discuss any topics that are not product-specific here.

Available "Big data sets" over the internet...

13 - Pulsar

I'm opening this topic for everyone to list some Big data* sets available over the net.


  • Feel free to list competion/datathon data sets
  • Results of web scraping
  • Social media data
  • Anything bigger than 1 mio records (beyond excel and access)






* Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples... 



[Edited by a Moderator]

We've compiled the responses to this thread into the following Knowledge Base article: Available "Big Data Sets" on the Web


12 - Quasar

Great suggestion. It would be great if collectively we can find a few free, public big data sets that can be used for examples of different techniques in Alteryx as well.


I'll add a link for the GDELT set, which was used for the 2015 Tableau IronViz competition at their conference.  Info on that data set can be found here.


Amazon (AWS) has a Large Data Sets Repository. has close to 190k public data sets.  Of course not all of the sets there qualify as 'big' data, but it;s a great source of free data.


One of the standard datasets for Hadoop is the Enron email dataset comprising emails between Enron employees during the scandal.  It's a great practice dataset for dealing with semi-structured data (file scraping, regexes, parsing, joining, etc.).


It's ~400MB (compressed) and available for download at


13 - Pulsar It is acollection of audio features and metadata for a million contemporary popular music tracks. Purposes are:

  • To encourage research on algorithms that scale to commercial sizes
  • To provide a reference dataset for evaluating research
  • As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's)

The Million Song Dataset is a cluster of complementary datasets contributed by the community:

You can either download the entire dataset (280 GB) or a subset of 10,000 songs (1.8) for a quick taste.


9 - Comet

NY City taxi data sets 1.1BN records


Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance
An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data


Airline data set 1987-2208





9 - Comet

Google Big Table - hosted by Google


Name Description

gsodContains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.
github_nestedContains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012.
github_timelineContains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012.


137,826,763 rows

Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.
shakespeareContains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.
trigramsContains English language trigrams from a sample of works published between 1520 and 2008.


313,797,035 rows

Contains the complete revision history for all Wikipedia articles up to April 2010.



The natality dataset was used to illustrate a blog post related to frequency of births spread by months' year.

7 - Meteor

I always love the Kaggle Datasets (


Datasets for all kind of different subjects (online, stocks, retail, health etc. etc.)

9 - Comet

Financial loans' public data:








13 - Pulsar

Hi all,


the link I'll provide is not an actual data set, it is a data set generator that creates simulated call data (CDR) records,

if you happen to model telecom behavioural segmentation models, propensity to churn models or mobility etc. you may start playin with that I suppose...





13 - Pulsar

Here is a telecom Italia dataset as a result of a computation over the Call Detail Records (CDRs)

generated by the Telecom Italia cellular network over the city of Milano;


You may have to sign-in and activate your account but it's totally free...