This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
on 01-05-201703:57 PM - edited on 03-08-201901:06 PM by SydneyF
Where can I find available "Big Data Sets" over the internet?
Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year-long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples...
One of the standard datasets for Hadoop is the Enron email dataset comprising emails between Enron employees during the scandal. It's a great practice dataset for dealing with semi-structured data (file scraping, regexes, parsing, joining, etc.). It's ~400MB (compressed) and available for download at http://www.cs.cmu.edu/~enron/
Google Big Table - hosted by Google: https://cloud.google.com/bigquery/sample-tables Weather, timeline of actions such as pull requests and comments on GitHub repositories with a nested or flat schema, US births 1969-2008, Shakespeare - number of times each word appears, Wikipedia articles over 300,000,000 million rows.
One of my favorites are Data.gov where there is tons of public data from all sectors, different size sets and in different formats including API connections. This url, http://www.data.gov/open-gov/, shows each of the local governments in the US. They have varying degrees of completion on the local level.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open-source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions. https://cloud.google.com/bigquery/public-data/github