Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

General Discussions

Discuss any topics that are not product-specific here.
SOLVED

Available "Big data sets" over the internet...

Atabarezz
13 - Pulsar

I'm opening this topic for everyone to list some Big data* sets available over the net.

 

  • Feel free to list competion/datathon data sets
  • Results of web scraping
  • Social media data
  • Anything bigger than 1 mio records (beyond excel and access)

 

Best

 

Altan

 

* Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples... 

 

--------------

[Edited by a Moderator]

We've compiled the responses to this thread into the following Knowledge Base article: Available "Big Data Sets" on the Web

--------------

62 REPLIES 62
JessHansen
7 - Meteor

... and it has a rather ease of access API to get the data into Alteryx

Atabarezz
13 - Pulsar

Search New York Times articles from 1851 to today... Wow!

  • retrieving headlines,
  • abstracts and
  • links to associated multimedia.

NYTDevLogo

 

You can also search;

  • book reviews,
  • movie reviews,
  • NYC event listings, 
  • top stories with images and more.

Here is the link;https://developer.nytimes.com/

 

 

Merry Xmass!

 

Altan @Atabarezz

altan.atabarut@altdata.co

 

 

asabau
8 - Asteroid

Here is a good source for Financial data. Some of it is free and nice to play around with.

 

https://www.quandl.com/

LordNeilLord
15 - Aurora

Thirty Eight Five have opened up all of their datasets:

 

https://data.fivethirtyeight.com/

Cristian
9 - Comet
GarthM
Alteryx Alumni (Retired)

I didn't see this is previous posts so here:

 

SQUAD Dataset

(Stanford Question Answering Dataset)

 

apologies if this is a duplicate

Atabarezz
13 - Pulsar

Definitely not a duplicate and an excellent source... Thanks

 

A short brief; Stanford Question Answering Dataset is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage.

 

Picture1.png

 

 

With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets.

Basically you can build your SIRI or better WATSON for Jeopardy out of this...

 

Picture2.jpg

Atabarezz
13 - Pulsar

 

Hope you have heard of process mining. It's essentially the same as data mining; to analyze data from different perspectives and summarize it into insights that can be used when making business decisions.

 

But this time the context are the business processes of an organization. In process mining event logs, data that exists in the information systems of a company are use to visualize and benchmark what is actually happening in the company’s processes and how they are executed in real life.

 

Almost all IT systems store data in data bases and create logs that can be described in process mining terms as event data.

So below you'll be able to reach out many different event log data and start doing your process mining tasks using @Alteryx

 

http://data.4tu.nl/repository/collection:event_logs_real

 

figure2-english_small

 Initial-process-map-markup_small

 

 

above pic represents an inflection point where there is huge processing time... potentially a "bottleneck"

 

 

sschakra
5 - Atom

Hi there - I am not able to access this link:

 

http://www.gedis-studio.com/online-call-detail-records-cdr-generator.html

 

I'm looking for a cdr generator and was hoping this would help. Please guide. Many thanks!

 

Labels