community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
#SANTALYTICS

The highly anticipated Alteryx Community tradition is back! We hope you'll join us!

Learn More
We will be upgrading the Gallery to our 2019.4 release this Saturday, December 7th beginning at 9:00am MT. We expect the outage to take last approx. 2.5 hours.
SOLVED

Available "Big data sets" over the internet...

Meteor

apologies if this is a repost:

 

Data Is Plural

Alteryx Partner

 

https://msropendata.com/

is announced recently, just check the big data sets out... Awesome

 

 

Picture1.png

 

Picture2.png

Sr. Community Content Manager
Sr. Community Content Manager

Many of these have already been mentioned, but a decent roundup post: The 50 Best Public Datasets for Machine Learning

Alteryx
Alteryx

Not all of the datasets here are 'big data', but this is a great tool that I use for coming up with fun/creative datasets for demos and such;

 

https://toolbox.google.com/datasetsearch

This may be listed in some of the other links that aggregate other data sets within this thread but I didn't see it mentioned independently.  Here is the UCI Machine Learning data repository:  link

 

UCI_MachineLearningRepositorySS.jpg

Alteryx Partner

Patent analytics is a great dimension that utilizes

  • Text mining
  • Natural language processing (Nlp)
  • Social network analysis (Sna)
  • Predictive analytics

 

You can analyze interesting things like;

What is hot in recent patent applications?

What are some keyword trends in historical patent grants

Which person or company is most cited?

On which topic the next few upcoming patents will be for a specific industry or company?

 

 

OE3004.png

https://bulkdata.uspto.gov/

 

Some bulk data sets are;

Patent Official Gazettes (JUL 2, 2002 - PRESENT)
Contains bibliographic (front page) information, a representative claim, and a drawing (if applicable) of each patent grant issued that week.

 

Patent Grant Multi-Page PDF Images (JUL 31, 1790 - PRESENT)
Contains the images of each patent grant issued weekly (Tuesdays) from July 31, 1790 to present in Portable Document Format (PDF)

 

Patent Grant Single-Page TIFF Images (JUL 31, 1790 - PRESENT) (Grant Yellow Book 2 based on WIPO ST.33)
Contains the images of each patent grant issued weekly (Tuesdays) from July 31, 1790 to present in Tagged Image File Format (TIFF)

 

Patent Grant Full Text Data with Embedded TIFF Images (JAN 2001 - PRESENT) (Grant Red Book based on WIPO ST.36)
Contains the full text, images/drawings, and complex work units (tables, mathematical expressions, chemical structures, and genetic sequence data) of each patent grant issued weekly (Tuesdays) from January 1, 2001 to present.

 

Patent Grant Full Text Data (No Images) (JAN 1976 - PRESENT)
Contains the full text of each patent grant issued weekly (Tuesdays) from January 1, 1976 to present (excludes images/drawings). Subset of the Patent Grant Full Text Data with Embedded TIFF Images.

 

Here is a page that utilizes these public data sets;

http://www.patentsview.org/web/#viz/relationships

 

 

Adsız.jpg

 

Sr. Community Content Manager
Sr. Community Content Manager

"Elections integrity" data from Twitter: https://about.twitter.com/en_us/values/elections-integrity.html#data

 

You'll need to enter your email address to get access to the data.

 

From the webpage:

 

In line with our principles of transparency and to improve public understanding of alleged foreign influence campaigns, Twitter is making publicly available archives of Tweets and media that we believe resulted from potentially state-backed information operations on our service.

 

These datasets are of a size that a degree of capability for large dataset analysis is required, we hope to support broad analysis by making a public version of these datasets (with some account-specific information hashed) available. You can download the datasets below. No content has been redacted. Specialist researchers can request access to an unhashed version of these datasets, which will be governed by a data use agreement that will include provisions to ensure the data is used within appropriate legal and ethical parameters.

 

These datasets include all public, nondeleted Tweets and media (e.g., images and videos) from accounts we believe are connected to state-backed information operations. Tweets deleted by these users prior to their suspension (which are not included in these datasets) generally comprise less than 1% of their overall activity. Note that not all of the accounts we identified as connected to these campaigns actively Tweeted, so the number of accounts represented in the datasets may be less than the total number of accounts listed here.

Highlighted
Alteryx Partner

https://webrobots.io/projects/ 

 

WebRobotsLogo300x82.png

 

This site has a scraper robot which crawls web sites and collects data about them. Here are a few useful projects they share freely;

 

Indiegogo_logo.png

 "We have a scraper robot which crawls Indiegogo projects and collects data about them. This robot was launched in May 2016 and we run crawl once a month. First dataset contains data about 91.5k projects."

https://webrobots.io/indiegogo-dataset/

 

 

1280px-Kickstarter_logo.svg.png

 

"We have a scraper robot which crawls all Kickstarter projects and collects data in CSV and JSON formats. From March 2016 we run this data crawl once a month."

 

Note: from April 2015 we noticed that Kickstarter started limiting how many projects user can view in a single category. This limits the amount of historic projects we can get in a single scrape run. But recent and active projects are always included.

Note: from December 2015 we modified the collection approach to go through all sub-categories instead of only top level categories. This yields more results in the datasets, but possible duplication where projects are listed in multiple categories. Also from December 2015 JSON file is in JSON streaming format. Read more about it here: https://en.wikipedia.org/wiki/JSON_Streaming

Warning: files are compressed, size in area of 100mb. Uncompressed size around 600mb.

 

Labels