Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Available "Big data sets" over the internet...

Alteryx Partner

I'm opening this topic for everyone to list some Big data* sets available over the net.


  • Feel free to list competion/datathon data sets
  • Results of web scraping
  • Social media data
  • Anything bigger than 1 mio records (beyond excel and access)






* Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples... 



[Edited by a Moderator]

We've compiled the responses to this thread into the following Knowledge Base article: Available "Big Data Sets" on the Web



Great suggestion. It would be great if collectively we can find a few free, public big data sets that can be used for examples of different techniques in Alteryx as well.


I'll add a link for the GDELT set, which was used for the 2015 Tableau IronViz competition at their conference.  Info on that data set can be found here.


Amazon (AWS) has a Large Data Sets Repository. has close to 190k public data sets.  Of course not all of the sets there qualify as 'big' data, but it;s a great source of free data.


One of the standard datasets for Hadoop is the Enron email dataset comprising emails between Enron employees during the scandal.  It's a great practice dataset for dealing with semi-structured data (file scraping, regexes, parsing, joining, etc.).


It's ~400MB (compressed) and available for download at


Alteryx Partner It is acollection of audio features and metadata for a million contemporary popular music tracks. Purposes are:

  • To encourage research on algorithms that scale to commercial sizes
  • To provide a reference dataset for evaluating research
  • As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's)

The Million Song Dataset is a cluster of complementary datasets contributed by the community:

You can either download the entire dataset (280 GB) or a subset of 10,000 songs (1.8) for a quick taste.



NY City taxi data sets 1.1BN records


Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance
An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data


Airline data set 1987-2208






Google Big Table - hosted by Google


Name Description

gsodContains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.
github_nestedContains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012.
github_timelineContains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012.


137,826,763 rows

Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.
shakespeareContains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.
trigramsContains English language trigrams from a sample of works published between 1520 and 2008.


313,797,035 rows

Contains the complete revision history for all Wikipedia articles up to April 2010.



The natality dataset was used to illustrate a blog post related to frequency of births spread by months' year.

Alteryx Certified Partner

I always love the Kaggle Datasets (


Datasets for all kind of different subjects (online, stocks, retail, health etc. etc.)


Financial loans' public data:








Alteryx Partner

Hi all,


the link I'll provide is not an actual data set, it is a data set generator that creates simulated call data (CDR) records,

if you happen to model telecom behavioural segmentation models, propensity to churn models or mobility etc. you may start playin with that I suppose...





Alteryx Partner

Here is a telecom Italia dataset as a result of a computation over the Call Detail Records (CDRs)

generated by the Telecom Italia cellular network over the city of Milano;


You may have to sign-in and activate your account but it's totally free...