01-05-2017 03:57 PM - edited 07-09-2021 11:26 AM
Where can I find available "Big Data Sets" over the internet?
Bigdata is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year-long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9months, behavioral credit data of a large financial institutionare some examples...
One of the standard datasets for Hadoop is the Enron email dataset comprising emails between Enron employees during the scandal. It's a great practice dataset for dealing with semi-structured data (file scraping, regexes, parsing, joining, etc.). It's ~400MB (compressed) and available for download athttp://www.cs.cmu.edu/~enron/
Collection of audio features and metadata for a million contemporary popular music tracks http://labrosa.ee.columbia.edu/millionsong/. SecondHandSongs dataset -> cover songs musiXmatch dataset -> lyrics Last.fm dataset -> song-level tags and similarity Taste Profile subset -> user data thisismyjam-to-MSD mapping -> more user data tagtraum genre annotations -> genre labels Top MAGD dataset -> more genre labels You can eitherdownload the entire dataset (280 GB) ora subset of 10,000 songs (1.8) for a quick taste.
NY City taxi data sets 1.1BN records:http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtmlAnalyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance
An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data
Airline data set 1987-2008:https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O
Google Big Table - hosted by Google:https://cloud.google.com/bigquery/sample-tablesWeather, timeline of actions such as pull requests and comments on GitHub repositories with a nested or flat schema, US births 1969-2008, Shakespeare - number of times each word appears, Wikipedia articles over 300,000,000 million rows.
LENDING CLUB:https://www.lendingclub.com/info/download-data.action
For India - www.data.gov.in
Originally there is an ongoing discussion where we list all the data sets we find. Here is the link;
On January 23, 2020 Google released Dataset Search, a free tool for searching 25 million publicly available datasets.
Google blog post: https://www.google.com/amp/s/blog.google/products/search/discovering-millions-datasets-web/amp/
Related post: https://towardsdatascience.com/google-just-published-25-million-free-datasets-d83940e24284
Dataset Search: http://g.co/datasetsearch
You can also check this Data Science Company