General Discussions

Atabarezz · ‎11-22-2015

I'm opening this topic for everyone to list some Big data* sets available over the net.

Feel free to list competion/datathon data sets
Results of web scraping
Social media data
Anything bigger than 1 mio records (beyond excel and access)

Best

Altan

* Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples...

--------------

[Edited by a Moderator]

We've compiled the responses to this thread into the following Knowledge Base article: Available "Big Data Sets" on the Web

--------------

GarthMiles · ‎09-13-2018

apologies if this is a repost:

Data Is Plural

Atabarezz · ‎10-06-2018

https://msropendata.com/

is announced recently, just check the big data sets out... Awesome

NeilR · ‎10-07-2018

Many of these have already been mentioned, but a decent roundup post: The 50 Best Public Datasets for Machine Learning

ZacharyM · ‎12-03-2018

Not all of the datasets here are 'big data', but this is a great tool that I use for coming up with fun/creative datasets for demos and such;

https://toolbox.google.com/datasetsearch

OldDogNewTricks · ‎12-05-2018

This may be listed in some of the other links that aggregate other data sets within this thread but I didn't see it mentioned independently. Here is the UCI Machine Learning data repository: link

Atabarezz · ‎04-03-2019

Patent analytics is a great dimension that utilizes

Text mining
Natural language processing (Nlp)
Social network analysis (Sna)
Predictive analytics

You can analyze interesting things like;

What is hot in recent patent applications?

What are some keyword trends in historical patent grants

Which person or company is most cited?

On which topic the next few upcoming patents will be for a specific industry or company?

https://bulkdata.uspto.gov/

Some bulk data sets are;

Patent Official Gazettes (JUL 2, 2002 - PRESENT)
Contains bibliographic (front page) information, a representative claim, and a drawing (if applicable) of each patent grant issued that week.

Patent Grant Multi-Page PDF Images (JUL 31, 1790 - PRESENT)
Contains the images of each patent grant issued weekly (Tuesdays) from July 31, 1790 to present in Portable Document Format (PDF)

Patent Grant Single-Page TIFF Images (JUL 31, 1790 - PRESENT) (Grant Yellow Book 2 based on WIPO ST.33)
Contains the images of each patent grant issued weekly (Tuesdays) from July 31, 1790 to present in Tagged Image File Format (TIFF)

Patent Grant Full Text Data with Embedded TIFF Images (JAN 2001 - PRESENT) (Grant Red Book based on WIPO ST.36)
Contains the full text, images/drawings, and complex work units (tables, mathematical expressions, chemical structures, and genetic sequence data) of each patent grant issued weekly (Tuesdays) from January 1, 2001 to present.

Patent Grant Full Text Data (No Images) (JAN 1976 - PRESENT)
Contains the full text of each patent grant issued weekly (Tuesdays) from January 1, 1976 to present (excludes images/drawings). Subset of the Patent Grant Full Text Data with Embedded TIFF Images.

Here is a page that utilizes these public data sets;

http://www.patentsview.org/web/#viz/relationships

NeilR · ‎07-07-2019

"Elections integrity" data from Twitter: https://about.twitter.com/en_us/values/elections-integrity.html#data

You'll need to enter your email address to get access to the data.

From the webpage:

In line with our principles of transparency and to improve public understanding of alleged foreign influence campaigns, Twitter is making publicly available archives of Tweets and media that we believe resulted from potentially state-backed information operations on our service.

These datasets are of a size that a degree of capability for large dataset analysis is required, we hope to support broad analysis by making a public version of these datasets (with some account-specific information hashed) available. You can download the datasets below. No content has been redacted. Specialist researchers can request access to an unhashed version of these datasets, which will be governed by a data use agreement that will include provisions to ensure the data is used within appropriate legal and ethical parameters.

These datasets include all public, nondeleted Tweets and media (e.g., images and videos) from accounts we believe are connected to state-backed information operations. Tweets deleted by these users prior to their suspension (which are not included in these datasets) generally comprise less than 1% of their overall activity. Note that not all of the accounts we identified as connected to these campaigns actively Tweeted, so the number of accounts represented in the datasets may be less than the total number of accounts listed here.

Atabarezz · ‎07-15-2019

https://webrobots.io/projects/

This site has a scraper robot which crawls web sites and collects data about them. Here are a few useful projects they share freely;

"We have a scraper robot which crawls Indiegogo projects and collects data about them. This robot was launched in May 2016 and we run crawl once a month. First dataset contains data about 91.5k projects."

https://webrobots.io/indiegogo-dataset/

1280px-Kickstarter_logo.svg.png

"We have a scraper robot which crawls all Kickstarter projects and collects data in CSV and JSON formats. From March 2016 we run this data crawl once a month."

Note: from April 2015 we noticed that Kickstarter started limiting how many projects user can view in a single category. This limits the amount of historic projects we can get in a single scrape run. But recent and active projects are always included.

Note: from December 2015 we modified the collection approach to go through all sub-categories instead of only top level categories. This yields more results in the datasets, but possible duplication where projects are listed in multiple categories. Also from December 2015 JSON file is in JSON streaming format. Read more about it here: https://en.wikipedia.org/wiki/JSON_Streaming

Warning: files are compressed, size in area of 100mb. Uncompressed size around 600mb.

nicolasdeldalle · ‎12-19-2019

Here's a good site for public data in Brazil ! dados.gov.br

FláviaB · ‎12-19-2019

Thank you for sharing @nicolasdeldalle!

Also, haven't see you in a while in our Portuguese Community 😉

Flávia Brancato

General Discussions

Available "Big data sets" over the internet...

Re: Advent of Code 2025 Day 3 (BaseA Style)

Re: Advent of Code 2025 Day 4 (BaseA Style)

Re: Advent of Code 2025 Day 1 (BaseA Style)

Re: Advent of Code 2025 Day 3 (BaseA Style)

Re: Advent of Code 2025 Day 1 (BaseA Style)