Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

General Discussions

Discuss any topics that are not product-specific here.
SOLVED

Available "Big data sets" over the internet...

Atabarezz
13 - Pulsar

I'm opening this topic for everyone to list some Big data* sets available over the net.

 

  • Feel free to list competion/datathon data sets
  • Results of web scraping
  • Social media data
  • Anything bigger than 1 mio records (beyond excel and access)

 

Best

 

Altan

 

* Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples... 

 

--------------

[Edited by a Moderator]

We've compiled the responses to this thread into the following Knowledge Base article: Available "Big Data Sets" on the Web

--------------

62 REPLIES 62
wymanb
5 - Atom

Here is a link to a list of data sources that I compiled a while back.  Hope it helps!

 

https://www.linkedin.com/pulse/need-data-bob-wyman?trk=mp-author-card

tom_montpool
12 - Quasar

The Government of Canada has an Open Data portal -- http://open.canada.ca/en/open-data -- it takes some digging to find the gems, but there are some.

 

There's also some open mapping data at -- http://open.canada.ca/en/open-maps.

 

I don't know how many of these qualify as "Big data sets"...but there are a few.

 

Cristian
9 - Comet

 

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

 

https://cloud.google.com/bigquery/public-data/github

Cristian
9 - Comet

 

15.49TB of research data available.
http://academictorrents.com/

 

A scalable, secure, and fault-tolerant repository for data, with blazing fast download speeds.

 

Regards,

Cristian

 

asslam
5 - Atom

Australia, New South Wales Open data 

http://data.nsw.gov.au/

 

Mixture of different Government department's data-sets. As Jason says, not all would qualify as big data.

MKosmicki
8 - Asteroid

The taxi dataset is what was used for IronViz at the Tableau conference in Nov 2016.

BrianO
Alteryx Alumni (Retired)

http://usafacts.org

 

This site just opened up and has tons of data. It looks like the ability to download each set is "coming soon" as the site is in beta at the time of this posting.

RithiS
Alteryx
Alteryx

https://public.enigma.com/ - Enigma Public states "they the world’s broadest collection of public data."

gnans19
11 - Bolide

https://data.world/

 

Interesting datasets to enrich our data.

Cristian
9 - Comet

Vessel Traffic Data

https://marinecadastre.gov/ais/

 

11 billion rows of public ship AIS data to explore, spanning from 2009 to 2014

Labels