Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

General Discussions

Discuss any topics that are not product-specific here.
SOLVED

Available "Big data sets" over the internet...

Atabarezz
13 - Pulsar

I'm opening this topic for everyone to list some Big data* sets available over the net.

 

  • Feel free to list competion/datathon data sets
  • Results of web scraping
  • Social media data
  • Anything bigger than 1 mio records (beyond excel and access)

 

Best

 

Altan

 

* Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples... 

 

--------------

[Edited by a Moderator]

We've compiled the responses to this thread into the following Knowledge Base article: Available "Big Data Sets" on the Web

--------------

62 REPLIES 62
GarthMiles
7 - Meteor

apologies if this is a repost:

 

Data Is Plural

Atabarezz
13 - Pulsar

 

https://msropendata.com/

is announced recently, just check the big data sets out... Awesome

 

 

Picture1.png

 

Picture2.png

NeilR
Alteryx Alumni (Retired)

Many of these have already been mentioned, but a decent roundup post: The 50 Best Public Datasets for Machine Learning

ZacharyM
Alteryx Alumni (Retired)

Not all of the datasets here are 'big data', but this is a great tool that I use for coming up with fun/creative datasets for demos and such;

 

https://toolbox.google.com/datasetsearch

OldDogNewTricks
10 - Fireball

This may be listed in some of the other links that aggregate other data sets within this thread but I didn't see it mentioned independently.  Here is the UCI Machine Learning data repository:  link

 

UCI_MachineLearningRepositorySS.jpg

Atabarezz
13 - Pulsar

Patent analytics is a great dimension that utilizes

  • Text mining
  • Natural language processing (Nlp)
  • Social network analysis (Sna)
  • Predictive analytics

 

You can analyze interesting things like;

What is hot in recent patent applications?

What are some keyword trends in historical patent grants

Which person or company is most cited?

On which topic the next few upcoming patents will be for a specific industry or company?

 

 

OE3004.png

https://bulkdata.uspto.gov/

 

Some bulk data sets are;

Patent Official Gazettes (JUL 2, 2002 - PRESENT)
Contains bibliographic (front page) information, a representative claim, and a drawing (if applicable) of each patent grant issued that week.

 

Patent Grant Multi-Page PDF Images (JUL 31, 1790 - PRESENT)
Contains the images of each patent grant issued weekly (Tuesdays) from July 31, 1790 to present in Portable Document Format (PDF)

 

Patent Grant Single-Page TIFF Images (JUL 31, 1790 - PRESENT) (Grant Yellow Book 2 based on WIPO ST.33)
Contains the images of each patent grant issued weekly (Tuesdays) from July 31, 1790 to present in Tagged Image File Format (TIFF)

 

Patent Grant Full Text Data with Embedded TIFF Images (JAN 2001 - PRESENT) (Grant Red Book based on WIPO ST.36)
Contains the full text, images/drawings, and complex work units (tables, mathematical expressions, chemical structures, and genetic sequence data) of each patent grant issued weekly (Tuesdays) from January 1, 2001 to present.

 

Patent Grant Full Text Data (No Images) (JAN 1976 - PRESENT)
Contains the full text of each patent grant issued weekly (Tuesdays) from January 1, 1976 to present (excludes images/drawings). Subset of the Patent Grant Full Text Data with Embedded TIFF Images.

 

Here is a page that utilizes these public data sets;

http://www.patentsview.org/web/#viz/relationships

 

 

Adsız.jpg

 

NeilR
Alteryx Alumni (Retired)

"Elections integrity" data from Twitter: https://about.twitter.com/en_us/values/elections-integrity.html#data

 

You'll need to enter your email address to get access to the data.

 

From the webpage:

 

In line with our principles of transparency and to improve public understanding of alleged foreign influence campaigns, Twitter is making publicly available archives of Tweets and media that we believe resulted from potentially state-backed information operations on our service.

 

These datasets are of a size that a degree of capability for large dataset analysis is required, we hope to support broad analysis by making a public version of these datasets (with some account-specific information hashed) available. You can download the datasets below. No content has been redacted. Specialist researchers can request access to an unhashed version of these datasets, which will be governed by a data use agreement that will include provisions to ensure the data is used within appropriate legal and ethical parameters.

 

These datasets include all public, nondeleted Tweets and media (e.g., images and videos) from accounts we believe are connected to state-backed information operations. Tweets deleted by these users prior to their suspension (which are not included in these datasets) generally comprise less than 1% of their overall activity. Note that not all of the accounts we identified as connected to these campaigns actively Tweeted, so the number of accounts represented in the datasets may be less than the total number of accounts listed here.

Atabarezz
13 - Pulsar

https://webrobots.io/projects/ 

 

WebRobotsLogo300x82.png

 

This site has a scraper robot which crawls web sites and collects data about them. Here are a few useful projects they share freely;

 

Indiegogo_logo.png

 "We have a scraper robot which crawls Indiegogo projects and collects data about them. This robot was launched in May 2016 and we run crawl once a month. First dataset contains data about 91.5k projects."

https://webrobots.io/indiegogo-dataset/

 

 

1280px-Kickstarter_logo.svg.png

 

"We have a scraper robot which crawls all Kickstarter projects and collects data in CSV and JSON formats. From March 2016 we run this data crawl once a month."

 

Note: from April 2015 we noticed that Kickstarter started limiting how many projects user can view in a single category. This limits the amount of historic projects we can get in a single scrape run. But recent and active projects are always included.

Note: from December 2015 we modified the collection approach to go through all sub-categories instead of only top level categories. This yields more results in the datasets, but possible duplication where projects are listed in multiple categories. Also from December 2015 JSON file is in JSON streaming format. Read more about it here: https://en.wikipedia.org/wiki/JSON_Streaming

Warning: files are compressed, size in area of 100mb. Uncompressed size around 600mb.

 

nicolasdeldalle
7 - Meteor

Here's a good site for public data in Brazil !  dados.gov.br

FláviaB
Alteryx Community Team
Alteryx Community Team

Thank you for sharing @nicolasdeldalle

 

Also, haven't see you in a while in our Portuguese Community 😉 

Flávia Brancato
Labels