This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I'm opening this topic for everyone to list some Big data* sets available over the net.
Feel free to list competion/datathon data sets
Results of web scraping
Social media data
Anything bigger than 1 mio records (beyond excel and access)
* Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples...
Patent Official Gazettes (JUL 2, 2002 - PRESENT) Contains bibliographic (front page) information, a representative claim, and a drawing (if applicable) of each patent grant issued that week.
Patent Grant Multi-Page PDF Images (JUL 31, 1790 - PRESENT) Contains the images of each patent grant issued weekly (Tuesdays) from July 31, 1790 to present in Portable Document Format (PDF)
Patent Grant Single-Page TIFF Images (JUL 31, 1790 - PRESENT) (Grant Yellow Book 2 based on WIPO ST.33) Contains the images of each patent grant issued weekly (Tuesdays) from July 31, 1790 to present in Tagged Image File Format (TIFF)
Patent Grant Full Text Data with Embedded TIFF Images (JAN 2001 - PRESENT) (Grant Red Book based on WIPO ST.36) Contains the full text, images/drawings, and complex work units (tables, mathematical expressions, chemical structures, and genetic sequence data) of each patent grant issued weekly (Tuesdays) from January 1, 2001 to present.
Patent Grant Full Text Data (No Images) (JAN 1976 - PRESENT) Contains the full text of each patent grant issued weekly (Tuesdays) from January 1, 1976 to present (excludes images/drawings). Subset of the Patent Grant Full Text Data with Embedded TIFF Images.
Here is a page that utilizes these public data sets;
You'll need to enter your email address to get access to the data.
From the webpage:
In line with our principles of transparency and to improve public understanding of alleged foreign influence campaigns, Twitter is making publicly available archives of Tweets and media that we believe resulted from potentially state-backed information operations on our service.
These datasets are of a size that a degree of capability for large dataset analysis is required, we hope to support broad analysis by making a public version of these datasets (with some account-specific information hashed) available. You can download the datasets below. No content has been redacted. Specialist researchers can request access to an unhashed version of these datasets, which will be governed by a data use agreement that will include provisions to ensure the data is used within appropriate legal and ethical parameters.
These datasets include all public, nondeleted Tweets and media (e.g., images and videos) from accounts we believe are connected to state-backed information operations. Tweets deleted by these users prior to their suspension (which are not included in these datasets) generally comprise less than 1% of their overall activity. Note that not all of the accounts we identified as connected to these campaigns actively Tweeted, so the number of accounts represented in the datasets may be less than the total number of accounts listed here.
This site has a scraper robot which crawls web sites and collects data about them. Here are a few useful projects they share freely;
"We have a scraper robot which crawls Indiegogo projects and collects data about them. This robot was launched in May 2016 and we run crawl once a month. First dataset contains data about 91.5k projects."
"We have a scraper robot which crawls all Kickstarter projects and collects data in CSV and JSON formats. From March 2016 we run this data crawl once a month."
Note: from April 2015 we noticed that Kickstarter started limiting how many projects user can view in a single category. This limits the amount of historic projects we can get in a single scrape run. But recent and active projects are always included.
Note: from December 2015 we modified the collection approach to go through all sub-categories instead of only top level categories. This yields more results in the datasets, but possible duplication where projects are listed in multiple categories. Also from December 2015 JSON file is in JSON streaming format. Read more about it here: https://en.wikipedia.org/wiki/JSON_Streaming
Warning: files are compressed, size in area of 100mb. Uncompressed size around 600mb.