Available "Big data sets" over the internet...

Question

I'm opening this topic for everyone to list some Big data* sets available over the net.

* Feel free to list competion/datathon data sets
* Results of web scraping
* Social media data
* Anything bigger than 1 mio records (beyond excel and access)

Best

Altan

* Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples...

--------------

[Edited by a Moderator]

We've compiled the responses to this thread into the following Knowledge Base article: Available "Big Data Sets" on the Web

--------------

dataMack · Accepted Answer

Great suggestion. It would be great if collectively we can find a few free, public big data sets that can be used for examples of different techniques in Alteryx as well.

I'll add a link for the GDELT set, which was used for the 2015 Tableau IronViz competition at their conference.  Info on that data set can be found here.

Amazon (AWS) has a Large Data Sets Repository.

Data.gov has close to 190k public data sets.  Of course not all of the sets there qualify as 'big' data, but it;s a great source of free data.

stevea · Accepted Answer

One of the standard datasets for Hadoop is the Enron email dataset comprising emails between Enron employees during the scandal.  It's a great practice dataset for dealing with semi-structured data (file scraping, regexes, parsing, joining, etc.).

It's ~400MB (compressed) and available for download at http://www.cs.cmu.edu/~enron/.

Atabarezz · Accepted Answer

http://labrosa.ee.columbia.edu/millionsong/ It is acollection of audio features and metadata for a million contemporary popular music tracks. Purposes are:

* To encourage research on algorithms that scale to commercial sizes
* To provide a reference dataset for evaluating research
* As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's)

The Million Song Dataset is a cluster of complementary datasets contributed by the community:

* SecondHandSongs dataset -> cover songs
* musiXmatch dataset -> lyrics
* Last.fm dataset -> song-level tags and similarity
* Taste Profile subset -> user data
* thisismyjam-to-MSD mapping -> more user data
* tagtraum genre annotations -> genre labels
* Top MAGD dataset -> more genre labels

You can either download the entire dataset (280 GB) or a subset of 10,000 songs (1.8) for a quick taste.

Cristian · Accepted Answer

NY City taxi data sets 1.1BN records

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance
An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data

Airline data set 1987-2208

https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O

Cristian.

Cristian · Accepted Answer

Google Big Table - hosted by Google

https://cloud.google.com/bigquery/sample-tables

Name Description

gsodContains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.github_nestedContains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012.github_timelineContains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012.natality

137,826,763 rows

Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.shakespeareContains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.trigramsContains English language trigrams from a sample of works published between 1520 and 2008.wikipedia

313,797,035 rows

Contains the complete revision history for all Wikipedia articles up to April 2010.

The natality dataset was used to illustrate a blog post related to frequency of births spread by months' year.

http://knowmore.washingtonpost.com/2015/03/31/chart-winter-really-is-baby-making-time/

_hans1 · Accepted Answer

I always love the Kaggle Datasets (https://www.kaggle.com/).

Datasets for all kind of different subjects (online, stocks, retail, health etc. etc.)

Cristian · Accepted Answer

Financial loans' public data&colon;

BONDORA

https://www.bondora.ee/en/invest/statistics/data_export

LENDING CLUB

https://www.lendingclub.com/info/download-data.action

Regards,

Cristian

Atabarezz · Accepted Answer

Hi all,

the link I'll provide is not an actual data set, it is a data set generator that creates simulated call data (CDR) records,

if you happen to model telecom behavioural segmentation models, propensity to churn models or mobility etc. you may start playin with that I suppose...

http://www.gedis-studio.com/online-call-detail-records-cdr-generator.html

Best

Altan

Atabarezz · Accepted Answer

Here is a telecom Italia dataset as a result of a computation over the Call Detail Records (CDRs)

generated by the Telecom Italia cellular network over the city of Milano;

You may have to sign-in and activate your account but it's totally free...

https://dandelion.eu/datagems/SpazioDati/telecom-sms-call-internet-mi/description/

Cristian · Accepted Answer

@Atabarezz

This is for you!

Temporal networks with igraph and R (with 20 lines of code!

Regards,

Cristian.

GarthM · Accepted Answer

Having not fully vetted this list myself I expect there to be a few good resources in there:

http://www.datasciencecentral.com/profiles/blogs/big-data-sets-available-for-free

raphaelrosati · Accepted Answer

Here are some more....

KD Nuggets is a well respected analytics blog, they have put together a very nice and deep list:

http://www.kdnuggets.com/datasets/index.html

UK Data

https://data.gov.uk/data

Google's PublicData Directory:

http://www.google.com/publicdata/directory

For the Spatial and GIS folks:

http://gisgeography.com/best-free-gis-data-sources-raster-vector/

Aamir · Accepted Answer

I give to you the mother of big datasets - Reddit. 1.7bn JSON objects; 250GB compressed.

https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment

ategg · Accepted Answer

Loads of really great links from here as well:

https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public

pvijendr · Accepted Answer

Here you go!! IMDB Datasets..

http://www.imdb.com/interfaces

A subset of the IMDb plain text data files is available from their FTP sites as follows:

* ftp.fu-berlin.de (Germany)
* ftp.funet.fi (Finland)

Thanks,

PV

Cristian · Accepted Answer

Twitter 2010 data set http://an.kaist.ac.kr/traces/WWW2010.html

andrewdatakim · Accepted Answer

One of my favorites are Data.gov where there is tons of public data from all sectors, different size sets and in different formats including API connections.This url,  http://www.data.gov/open-gov/ ,  shows each of the local goverments in the US. They have varying degrees of completion on the local level.

wymanb · Accepted Answer

Here is a link to a list of data sources that I compiled a while back.  Hope it helps!

https://www.linkedin.com/pulse/need-data-bob-wyman?trk=mp-author-card

tom_montpool · Accepted Answer

The Government of Canada has an Open Data portal -- http://open.canada.ca/en/open-data -- it takes some digging to find the gems, but there are some.

There's also some open mapping data at -- http://open.canada.ca/en/open-maps.

I don't know how many of these qualify as "Big data sets"...but there are a few.

Cristian · Accepted Answer

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

https://cloud.google.com/bigquery/public-data/github

Cristian · Accepted Answer

15.49TB of research data available.
http://academictorrents.com/

A scalable, secure, and fault-tolerant repository for data, with blazing fast download speeds.

Regards,

Cristian

asslam · Accepted Answer

Australia, New South Wales Open data

http://data.nsw.gov.au/

Mixture of different Government department's data-sets. As Jason says, not all would qualify as big data.

BrianO · Accepted Answer

http://usafacts.org

This site just opened up and has tons of data. It looks like the ability to download each set is "coming soon" as the site is in beta at the time of this posting.

RithiS · Accepted Answer

https://public.enigma.com/ - Enigma Public states "they the world’s broadest collection of public data."

gnans19 · Accepted Answer

https://data.world/

Interesting datasets to enrich our data.

Cristian · Accepted Answer

Vessel Traffic Data

https://marinecadastre.gov/ais/

11 billion rows of public ship AIS data to explore, spanning from 2009 to 2014

Atabarezz · Accepted Answer

Search New York Times articles from 1851 to today... Wow!

* retrieving headlines,
* abstracts and 
* links to associated multimedia.

You can also search;

* book reviews, 
* movie reviews,
* NYC event listings,  
* top stories with images and more.

Here is the link;https://developer.nytimes.com/

Merry Xmass!

Altan @Atabarezz

altan.atabarut@altdata.co

Cristian · Accepted Answer

http://www.butleranalytics.com/big-data-70-amazing-free-data-sources/

asabau · Accepted Answer

Here is a good source for Financial data. Some of it is free and nice to play around with.

https://www.quandl.com/

LordNeilLord · Accepted Answer

Thirty Eight Five have opened up all of their datasets:

https://data.fivethirtyeight.com/

Cristian · Accepted Answer

https://www.sajari.com/public-data

GarthM · Accepted Answer

I didn't see this is previous posts so here:

SQUAD Dataset

(Stanford Question Answering Dataset)

apologies if this is a duplicate

Atabarezz · Accepted Answer

Hope you have heard of process mining. It's essentially the same as data mining; to analyze data from different perspectives and summarize it into insights that can be used when making business decisions.

But this time the context are the business processes of an organization. In process mining event logs, data that exists in the information systems of a company are use to visualize and benchmark what is actually happening in the company’s processes and how they are executed in real life.

Almost all IT systems store data in data bases and create logs that can be described in process mining terms as event data.

So below you'll be able to reach out many different event log data and start doing your process mining tasks using @Alteryx

http://data.4tu.nl/repository/collection:event_logs_real

above pic represents an inflection point where there is huge processing time... potentially a "bottleneck"

GarthM · Accepted Answer

found another one!

http://stapi.co/

Atabarezz · Accepted Answer

https://research.google.com/youtube8m/

Cristian · Accepted Answer

FYI.

CData ODBC Driver for YouTube 2017

http://cdn.cdata.com/help/CGB/odbc/default.htm

Hakimipous · Accepted Answer

The kind of post I was looking for !

here is https://developers.themoviedb.org/3/getting-started/introduction

API key is free to use and can provide various data, and it makes it a good exercise to work with API and Data Viz

Cristian · Accepted Answer

SecRepo.com - Samples of Security Related Data

Cristian · Accepted Answer

INDEX of COMPLEX NETWORKS / GRAPHS

https://icon.colorado.edu/#!/networks

dataMack · Accepted Answer

Google released a 'Dataset Search' service today that will likely make it easier to find datasets across different sites:

https://toolbox.google.com/datasetsearch

https://www.blog.google/products/search/making-it-easier-discover-datasets/

GarthMiles · Accepted Answer

apologies if this is a repost:

Data Is Plural

Atabarezz · Accepted Answer

https://msropendata.com/

is announced recently, just check the big data sets out... Awesome

NeilR · Accepted Answer

Many of these have already been mentioned, but a decent roundup post: The 50 Best Public Datasets for Machine Learning

ZacharyM · Accepted Answer

Not all of the datasets here are 'big data', but this is a great tool that I use for coming up with fun/creative datasets for demos and such;

https://toolbox.google.com/datasetsearch

OldDogNewTricks · Accepted Answer

This may be listed in some of the other links that aggregate other data sets within this thread but I didn't see it mentioned independently.  Here is the UCI Machine Learning data repository:  link

Atabarezz · Accepted Answer

Patent analytics is a great dimension that utilizes

* Text mining
* Natural language processing (Nlp)
* Social network analysis (Sna)
* Predictive analytics

You can analyze interesting things like;

What is hot in recent patent applications?

What are some keyword trends in historical patent grants

Which person or company is most cited?

On which topic the next few upcoming patents will be for a specific industry or company?

https://bulkdata.uspto.gov/

Some bulk data sets are;

Patent Official Gazettes (JUL 2, 2002 - PRESENT)
Contains bibliographic (front page) information, a representative claim, and a drawing (if applicable) of each patent grant issued that week.

Patent Grant Multi-Page PDF Images (JUL 31, 1790 - PRESENT)
Contains the images of each patent grant issued weekly (Tuesdays) from July 31, 1790 to present in Portable Document Format (PDF)

Patent Grant Single-Page TIFF Images (JUL 31, 1790 - PRESENT) (Grant Yellow Book 2 based on WIPO ST.33)
Contains the images of each patent grant issued weekly (Tuesdays) from July 31, 1790 to present in Tagged Image File Format (TIFF)

Patent Grant Full Text Data with Embedded TIFF Images (JAN 2001 - PRESENT) (Grant Red Book based on WIPO ST.36)
Contains the full text, images/drawings, and complex work units (tables, mathematical expressions, chemical structures, and genetic sequence data) of each patent grant issued weekly (Tuesdays) from January 1, 2001 to present.

Patent Grant Full Text Data (No Images) (JAN 1976 - PRESENT)
Contains the full text of each patent grant issued weekly (Tuesdays) from January 1, 1976 to present (excludes images/drawings). Subset of the Patent Grant Full Text Data with Embedded TIFF Images.

Here is a page that utilizes these public data sets;

http://www.patentsview.org/web/#viz/relationships

NeilR · Accepted Answer

"Elections integrity" data from Twitter: https://about.twitter.com/en_us/values/elections-integrity.html#data

You'll need to enter your email address to get access to the data.

From the webpage:

In line with our principles of transparency and to improve public understanding of alleged foreign influence campaigns, Twitter is making publicly available archives of Tweets and media that we believe resulted from potentially state-backed information operations on our service.

These datasets are of a size that a degree of capability for large dataset analysis is required, we hope to support broad analysis by making a public version of these datasets (with some account-specific information hashed) available. You can download the datasets below. No content has been redacted. Specialist researchers can request access to an unhashed version of these datasets, which will be governed by a data use agreement that will include provisions to ensure the data is used within appropriate legal and ethical parameters.

These datasets include all public, nondeleted Tweets and media (e.g., images and videos) from accounts we believe are connected to state-backed information operations. Tweets deleted by these users prior to their suspension (which are not included in these datasets) generally comprise less than 1% of their overall activity. Note that not all of the accounts we identified as connected to these campaigns actively Tweeted, so the number of accounts represented in the datasets may be less than the total number of accounts listed here.

Atabarezz · Accepted Answer

https://webrobots.io/projects/

This site has a scraper robot which crawls web sites and collects data about them. Here are a few useful projects they share freely;

"We have a scraper robot which crawls Indiegogo projects and collects data about them. This robot was launched in May 2016 and we run crawl once a month. First dataset contains data about 91.5k projects."

https://webrobots.io/indiegogo-dataset/

"We have a scraper robot which crawls all Kickstarter projects and collects data in CSV and JSON formats. From March 2016 we run this data crawl once a month."

Note: from April 2015 we noticed that Kickstarter started limiting how many projects user can view in a single category. This limits the amount of historic projects we can get in a single scrape run. But recent and active projects are always included.

Note: from December 2015 we modified the collection approach to go through all sub-categories instead of only top level categories. This yields more results in the datasets, but possible duplication where projects are listed in multiple categories. Also from December 2015 JSON file is in JSON streaming format. Read more about it here: https://en.wikipedia.org/wiki/JSON_Streaming

Warning: files are compressed, size in area of 100mb. Uncompressed size around 600mb.

nicolasdeldalle · Accepted Answer

Here's a good site for public data in Brazil !  dados.gov.br