This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Where can I find a vailable "Big Data Sets" over the internet?
Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year-long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples...
Amazon (AWS) has a Large Data Sets Repository
Data.gov has close to 190k public data sets
One of the standard datasets for Hadoop is the Enron email dataset comprising emails between Enron employees during the scandal. It's a great practice dataset for dealing with semi-structured data (file scraping, regexes, parsing, joining, etc.). It's ~400MB (compressed) and available for download at http://www.cs.cmu.edu/~enron/
Collection of audio features and metadata for a million contemporary popular music tracks http://labrosa.ee.columbia.edu/millionsong/ . SecondHandSongs dataset -> cover songs musiXmatch dataset -> lyrics Last.fm dataset -> song-level tags and similarity Taste Profile subset -> user data thisismyjam-to-MSD mapping -> more user data tagtraum genre annotations -> genre labels Top MAGD dataset -> more genre labels You can either download the entire dataset (280 GB) or a subset of 10,000 songs (1.8) for a quick taste.
GDELT set: http://www.gdeltproject.org/data.html
NY City taxi data sets 1.1BN records: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data
Airline data set 1987-2008: https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O
Google Big Table - hosted by Google: https://cloud.google.com/bigquery/sample-tables Weather, timeline of actions such as pull requests and comments on GitHub repositories with a nested or flat schema, US births 1969-2008, Shakespeare - number of times each word appears, Wikipedia articles over 300,000,000 million rows.
LENDING CLUB: https://www.lendingclub.com/info/download-data.action
Here is a telecom Italia dataset as a result of a computation over the Call Detail Records (CDRs) generated by the Telecom Italia cellular network over the city of Milano; You may have to sign-in and activate your account but it's totally free... https://dandelion.eu/datagems/SpazioDati/telecom-sms-call-internet-mi/description/
Data Science Centralhttp://www.datasciencecentral.com/profiles/blogs/big-data-sets-available-for-free
KD Nuggets is a well respected analytics blog, they have put together a very nice and deep list: http://www.kdnuggets.com/datasets/index.html
UK Data https://data.gov.uk/data
Google's Public Data Directory: http://www.google.com/publicdata/directory
For the Spatial and GIS folks: http://gisgeography.com/best-free-gis-data-sources-raster-vector/
The mother of big datasets - Reddit. 1.7bn JSON objects; 250GB compressed. https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment
Loads of really great links from here as well: https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
http://www.imdb.com/interfaces A subset of the IMDb plain text data files is available from their FTP sites as follows: ftp.fu-berlin.de (Germany) ftp.funet.fi (Finland)
One of my favorites are Data.gov where there is tons of public data from all sectors, different size sets and in different formats including API connections. This url, http://www.data.gov/open-gov/ , shows each of the local governments in the US. They have varying degrees of completion on the local level.
The Government of Canada has an Open Data portal -- http://open.canada.ca/en/open-data -- it takes some digging to find the gems, but there are some.
There's also some open mapping data at -- http://open.canada.ca/en/open-maps.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open-source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions. https://cloud.google.com/bigquery/public-data/github
15.49TB of research data available. http://academictorrents.com/
Australia, New South Wales Open data http://data.nsw.gov.au/
USAFacts: Our Nation, in numbers. Federal, state, and local data from over 70 government sources.
What are some "Small Data Sets" available over the internet?
Small data is data that is small enough size for human comprehension. A few thousand lines of credit data or marketing segmentation example data, B2B client contact history of a firm are some examples...
kaggle.com "Kaggle is a platform for predictive modeling and analytics competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models. This crowdsourcing approach relies on the fact that there are countless strategies that can be applied to any predictive modeling task and it is impossible to know at the outset which technique or analyst will be most effective." There are multiple available small datasets that you can test your skills on: https://www.kaggle.com/c/informs2010 - The goal of this contest is to predict short term movements in stock prices. https://www.kaggle.com/c/axa-driver-telematics-analysis - Use telematic data to identify a driver signature. https://www.kaggle.com/c/sf-crime - Predict the category of crimes that occurred in the city by the bay. You may find 202 more under the following link https://www.kaggle.com/competitions/search?DeadlineColumnSort=Descending
Kaggle has started a section called Kaggle Datasets, that has public datasets that you can use as datasets for the competitions were often restricted for use outside the competition. https://www.kaggle.com/datasets
Kaggle also has scripts for processing the given data sets: https://www.kaggle.com/scripts , which are usually in R or Python. It can be instructive to look at those and discern which parts can be pulled into standard Alteryx tools, and which parts left to a custom R call, for instance. The nice thing is that, once you've finished, you can submit your output to the relevant Kaggle competition (even after the fact) to see how your output stacks up to the competition.
"Small Data" set to test your skills on Duplicate Detection, Record Linkage, and Identity Uncertainty http://www.cs.utexas.edu/users/ml/riddle/data.html
Here is an addition from Europe...http://open-data.europa.eu/en/data/ "The European Union Open Data Portal is the single point of access to a growing range of data from the institutions and other bodies of the European Union (EU). Data are free for you to use and reuse for commercial or non-commercial purposes. By providing easy and free access to data, the portal aims to promote their innovative use and unleash their economic potential. It also aims to help foster the transparency and the accountability of the institutions and other bodies of the EU."
While the Join tool is easily one of the most used tools in Alteryx, it can also be one of the most misunderstood. This is even more likely true if a new user hasn’t previously used joins in any other data-manipulating platform or they are joining big tables where they might not be keeping track of the records inside the fields they are joining on.
For most tools that already have “dynamic” in the name, it would be redundant to call them one of the most dynamic tools in the Designer. That’s not the case for Dynamic Input. With basic configuration, the Dynamic Input Tool allows you to specify a template (this can be a file or database table) and input any number of tables that match that template format (shape/schema) by reading in a list of other sources or modifying SQL queries. This is especially useful for periodic data sets, but the use of the tool goes far beyond its basic configuration. To aid in your data blending, we’ve gone ahead and cataloged a handful of uses that make the Dynamic Input Tool so versatile:
The Fuzzy Match Tool provides some pretty amazing flexibility for string joins with inexact values – usually in the case of names, addresses, phone numbers, or zip codes because many of the pre-configured match styles are designed around the formats of those types of string structures. However, taking advantage of the custom match style and carefully configuring the tool specific to human entered keyword strings in your data can also allow you to use the loose string matching feature of the tool to match those values to cleaner dictionary keyword strings. If done properly, it can help you take otherwise unusable strings and, matching by each individual word, recombine your human entered data to a standardized format that can be used in more advanced analyses.
As long as you know where to look, data has all the answers. Sometimes, though, those answers aren’t clear as day. More often than not, they need to be communicated in an effective format - a format that can let the data talk and highlight the important motifs for you. Another favorite of the Reporting Tool Category , the Charting Tool can do just that by adding expressive visuals to any report or presentation. Offering an exhaustive list of charts to choose from (area, stacked area, column, stacked column, bar, stacked bar, line, tornado, pareto , box and whisker, scatter, bubble, polar, radar, pie), the Charting Tool will give you the ability to add descriptive visuals, with legends and even watermarks, to your reporting workflows that will help you find the answers in your data.
Data Integrity refers to the accuracy and consistency of data stored in a database, data warehouse, data mart or other construct, and it is a fundamental component of any analytic workflow. In Alteryx, creating a macro to compare expected values to actual values in your data is quite simple and provides a quality control check before producing a visual report. Let me show you how to build this.
The two inputs represent the actual and expected values in your data. These data streams are passed through a Record ID tool to keep positional integrity and then passed on to the Transpose tool to create two columns. The first column contains the field names and the second column shows the values within each field. This data is then passed on to a join, matching on Record ID and the Name of the field, in order to compare each value. Lastly, if the data does not match from expected to actual, a custom message will appear in the results messages alerting the user where the mismatch happened within the dataset. The image below shows the error message produced if values differ across datasets.
Web scraping, the process of extracting information (usually tabulated) from websites, is an extremely useful approach to still gather web-hosted data that isn’t supplied via APIs. In many cases, if the data you are looking for is stand-alone or captured completely on one page (no need for dynamic API queries), it is even faster than developing direct API connections to collect.
The Directory Tool gives you a data-stream input that contains information about the files and folders (file name; file date; last modified, etc.) for the location of your choice, which you can then use for more complex interactions with the file system. Basically, the Directory Tool could also finally help me track down my keys - not just where I put the keys in the house, but also how long they've been there, and when they were last moved.
The Sample Tool allows you selectively pass patterns, block excerpts, or samples of your records (or groups of records) in your dataset: the first N, last N, skipping the first N, 1 of every N, random 1 in N chance for each record to pass, and first N%. Using these options can come in the clutch pretty often in data preparation – that’s why you’ll find it in our Favorites Category, and for good reason. While a great tool to sample your data sets, you can also use it for:
You know what really stinks? Working with addresses that aren’t standardized or verified. Whether human-input, or one of the many address formatting standards in the U.S., being stuck with an address you can’t either (1) identify or (2) ensure it exists can be a real pain in the… well…
CASS is here to help!
The ConsumerView Matching macro enables users to match their customer file to the Experian ConsumerView data. Starting with customer information such as name and address you can leverage the ConsumerView macro in Alteryx to append a variety of information about your customers such as household segmentation, home purchase price, presence of children in a home, estimated education and income levels, length of residence, and many more!
The Field Info Tool is another one of the gems hidden in the Developer Tool Category – however don’t be intimidated, this is a tool for all of us! The purpose of the Field Info Tool is to give you the information about the fields in your data in a way that you can use down-stream as part of your workflow. There are no settings to configure, so just drop it on your canvas and you’re good to go!
Believe it or not, data can be beautiful. Take your black and white data points and add some color to them in visuals with the suite of tools found in the Reporting Category https://help.alteryx.com/current/index.htm#Getting_Started/AllTools.htm#Report_Presentation_Tools ! If you’re looking to create reports, presentations, images, or simply output data with a bang, you can use the Render Tool https://help.alteryx.com/current/PortfolioComposerRender.htm paired with other Reporting Tools to create HTML files (*.html), Composer files (*.pcxml), PDF documents (*.pdf), RTF documents (*.rtf), Word documents (*.docx), Excel documents (*.xlsx), MHTML files (*.mht), Power Point presentations (*.pptx), PNG images (*.html), and even Zip files (*.zip) – packed with formatting and visual aesthetic that’ll make any data-geek’s mouth water.
When you’re frequently writing and rewriting data to Excel spreadsheets that you use for Excel graphs and charts, it can quickly become a hassle to make and remake your reporting objects to keep them up-to-date so you’re visualizing the most recent data. A best practice to keep the hassle out of the process exists, though! If you keep your plots isolated to their own spreadsheet, referencing cell values in another sheet used to capture your data, you can simply overwrite the source data sheet and your plots will update automatically upon launching Excel. In the example below (attached in the v10.6 workflow Dynamically Update Reporting from Excel Spreadsheets.yxzp) we’ve included the workaround to make your Excel outputs seamless.
The Auto Field Tool : a tool so easy you don’t have to do anything – just put it on your canvas and viola. Automatically optimized data types. If you’re running into data type related issues and errors in your workflows, or just looking to add some speed or reduce the occupied disk space your data is hoarding – look no further than the Preparation Tool Category ’s Auto Field Tool, which reads through all the records of an input and sets the field type to the smallest possible size relative to the data contained within the column.
This article provides step by step instructions for setting up a standard data install. There is already a fantastic article about Network Installs you can check out if you're interested.
A standard data install is pretty straight forward. Plug in the external drive and double click on the DataInstall.exe file to launch the installer.
When the welcome screen comes up, go ahead and click Next:
Read and Accept the license agreement on the next screen and click Next again:
Select the data sets you would like to install. If you want all of them just click the All button on the right. Otherwise, you can select individual datasets by selecting the check box next to them:
Next, choose any previously installed datasets that you would like to remove by selecting them in the tree structure similar to the previous screen. You don't have to choose anything here if you want to keep everything, however, keep in mind that the data bundle is very large and you may not have enough space to keep multiple vintages installed locally.
Finally, browse to the file path you would like to install the data to. The default path will be auto-populated but if you'd like to install it somewhere else just update the path:
Once you set your path, click Finish and sit back and relax while your data installs.