This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Where can I find a vailable "Big Data Sets" over the internet?
Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year-long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples...
Amazon (AWS) has a Large Data Sets Repository
Data.gov has close to 190k public data sets
One of the standard datasets for Hadoop is the Enron email dataset comprising emails between Enron employees during the scandal. It's a great practice dataset for dealing with semi-structured data (file scraping, regexes, parsing, joining, etc.). It's ~400MB (compressed) and available for download at http://www.cs.cmu.edu/~enron/
Collection of audio features and metadata for a million contemporary popular music tracks http://labrosa.ee.columbia.edu/millionsong/ . SecondHandSongs dataset -> cover songs musiXmatch dataset -> lyrics Last.fm dataset -> song-level tags and similarity Taste Profile subset -> user data thisismyjam-to-MSD mapping -> more user data tagtraum genre annotations -> genre labels Top MAGD dataset -> more genre labels You can either download the entire dataset (280 GB) or a subset of 10,000 songs (1.8) for a quick taste.
GDELT set: http://www.gdeltproject.org/data.html
NY City taxi data sets 1.1BN records: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data
Airline data set 1987-2008: https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O
Google Big Table - hosted by Google: https://cloud.google.com/bigquery/sample-tables Weather, timeline of actions such as pull requests and comments on GitHub repositories with a nested or flat schema, US births 1969-2008, Shakespeare - number of times each word appears, Wikipedia articles over 300,000,000 million rows.
LENDING CLUB: https://www.lendingclub.com/info/download-data.action
Here is a telecom Italia dataset as a result of a computation over the Call Detail Records (CDRs) generated by the Telecom Italia cellular network over the city of Milano; You may have to sign-in and activate your account but it's totally free... https://dandelion.eu/datagems/SpazioDati/telecom-sms-call-internet-mi/description/
Data Science Centralhttp://www.datasciencecentral.com/profiles/blogs/big-data-sets-available-for-free
KD Nuggets is a well respected analytics blog, they have put together a very nice and deep list: http://www.kdnuggets.com/datasets/index.html
UK Data https://data.gov.uk/data
Google's Public Data Directory: http://www.google.com/publicdata/directory
For the Spatial and GIS folks: http://gisgeography.com/best-free-gis-data-sources-raster-vector/
The mother of big datasets - Reddit. 1.7bn JSON objects; 250GB compressed. https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment
Loads of really great links from here as well: https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
http://www.imdb.com/interfaces A subset of the IMDb plain text data files is available from their FTP sites as follows: ftp.fu-berlin.de (Germany) ftp.funet.fi (Finland)
One of my favorites are Data.gov where there is tons of public data from all sectors, different size sets and in different formats including API connections. This url, http://www.data.gov/open-gov/ , shows each of the local governments in the US. They have varying degrees of completion on the local level.
The Government of Canada has an Open Data portal -- http://open.canada.ca/en/open-data -- it takes some digging to find the gems, but there are some.
There's also some open mapping data at -- http://open.canada.ca/en/open-maps.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open-source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions. https://cloud.google.com/bigquery/public-data/github
15.49TB of research data available. http://academictorrents.com/
Australia, New South Wales Open data http://data.nsw.gov.au/
USAFacts: Our Nation, in numbers. Federal, state, and local data from over 70 government sources.
What are some "Small Data Sets" available over the internet?
Small data is data that is small enough size for human comprehension. A few thousand lines of credit data or marketing segmentation example data, B2B client contact history of a firm are some examples...
kaggle.com "Kaggle is a platform for predictive modeling and analytics competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models. This crowdsourcing approach relies on the fact that there are countless strategies that can be applied to any predictive modeling task and it is impossible to know at the outset which technique or analyst will be most effective." There are multiple available small datasets that you can test your skills on: https://www.kaggle.com/c/informs2010 - The goal of this contest is to predict short term movements in stock prices. https://www.kaggle.com/c/axa-driver-telematics-analysis - Use telematic data to identify a driver signature. https://www.kaggle.com/c/sf-crime - Predict the category of crimes that occurred in the city by the bay. You may find 202 more under the following link https://www.kaggle.com/competitions/search?DeadlineColumnSort=Descending
Kaggle has started a section called Kaggle Datasets, that has public datasets that you can use as datasets for the competitions were often restricted for use outside the competition. https://www.kaggle.com/datasets
Kaggle also has scripts for processing the given data sets: https://www.kaggle.com/scripts , which are usually in R or Python. It can be instructive to look at those and discern which parts can be pulled into standard Alteryx tools, and which parts left to a custom R call, for instance. The nice thing is that, once you've finished, you can submit your output to the relevant Kaggle competition (even after the fact) to see how your output stacks up to the competition.
"Small Data" set to test your skills on Duplicate Detection, Record Linkage, and Identity Uncertainty http://www.cs.utexas.edu/users/ml/riddle/data.html
Here is an addition from Europe...http://open-data.europa.eu/en/data/ "The European Union Open Data Portal is the single point of access to a growing range of data from the institutions and other bodies of the European Union (EU). Data are free for you to use and reuse for commercial or non-commercial purposes. By providing easy and free access to data, the portal aims to promote their innovative use and unleash their economic potential. It also aims to help foster the transparency and the accountability of the institutions and other bodies of the EU."
Far more than just a window to your data, the Browse Tool has a catalog of features to best view, investigate, and copy/save data at any checkpoint you place it. That introspection to your data anywhere in your blending gives valuable feedback that often speeds workflow development and makes it easier to learn tools by readily visualizing their transforms. Be equipped, and browse through the catalog of useful applications below!
Time series forecasting is using a model to predict future values based on previously observed values. In a time series forecast, the prediction is based on history and we are assuming the future will resemble the past. We project current trends using existing data.
Typically the first step of Cluster Analysis in Alteryx Designer, the K-Centroids Diagnostics Tool assists you to in determining an appropriate number of clusters to specify for a clustering solution in the K-Centroids Cluster Analysis Tool, given your data and specified clustering algorithm. Cluster analysis is an unsupervised learning algorithm, which means that there are no provided labels or targets for the algorithm to base its solution on. In some cases, you may know how many groups your data ought to be split into, but when this is not the case, you can use this tool to guide the number of target clusters your data most naturally divides into.
Clustering analysis has a wide variety of use cases, including harnessing spatial data for grouping stores by location, performing customer segmentation or even insurance fraud detection. Clustering analysis groups individual observations in a way that each group (cluster) contains data that are more similar to one another than the data in other groups. Included with the Predictive Tools installation, the K-Centroids Cluster Analysis Tool allows you to perform cluster analysis on a data set with the option of using three different algorithms; K-Means , K-Medians , and Neural Gas . In this Tool Mastery, we will go through the configuration and outputs of the tool.
The Field Summary Tool analyzes data and creates a summary report containing descriptive statistics of data in selected columns. It’s a great tool to use when you want to make sure your data is structured correctly before using any further analysis, most notably with the suite of models that can be generated with the Predictive Tools.
The humble histogram is something many people are first exposed to in grade school. Histograms are a type of bar graph that display the distribution of continuous numerical data. Histograms are sometimes confused with bar charts, which are plots of categorical variables.
Welcome to the closing chapter of our voyage through the Pre-Predictive series! This has been a four-part journey introducing you to the thrilling world of data investigation. This section covers the plotting tools included in the Data Investigation Toolbox.
A common task that analysts can run into (and a good practice when analyzing data) is to determine if the means of 2 sampled groups are significantly different. When this inquest arises, the Test of Means tool is right for you! To demonstrate how to configure this tool and how to interpret the results, a workflow has been attached. The attached workflow (v. 11.7 ) compares the amount of money that customers spent across different regions in the US. The Dollars_Spent field identifies the amount of money an individual spent and the Region field identifies the region that the individual resides in (NORTH, SOUTH, EAST, WEST).
The Field Info Tool is another one of the gems hidden in the Developer Tool Category – however don’t be intimidated, this is a tool for all of us! The purpose of the Field Info Tool is to give you the information about the fields in your data in a way that you can use down-stream as part of your workflow. There are no settings to configure, so just drop it on your canvas and you’re good to go!
The Association Analysis Tool allows you to choose any numerical fields and assesses the level of correlation between those fields. You can either use the Pearson product-moment correlation, Spearmen rank-order correlation, or Hoeffding's D statistics to perform your analysis. You can also have the option of doing an in-depth analysis of your target variable in relation to the other numerical fields. After you’ve run through the tool, you will have two outputs:
We love helping users be successful with Alteryx, and this means providing a ton of great resources for getting started, learning more, and keeping you up to date with all the amazing stuff we're doing here at Alteryx… and the most compelling is Predictive!
Check out the Predictive District on the Gallery. There are great macros, apps, and sample workflows to demonstrate some nifty new tools. This post by DrDan on the Analytics Blog gives an overview of what's currently available – stay tuned for additions!
One of my favorites is the Predictive Analytics Starter Kit Volume 1. It enables you to learn the fundamentals of key predictive models with an interactive guided experience. Examples include Linear Regression, Logistic Regression, and AB Testing, and demonstrates the steps necessary to develop the dataset needed for analysis, and then how to actually build these predictive models yourself.
With v10.6, we introduced the Prescriptive Tool Category, comprising the Optimization and Simulation tools, to assist with determining the best course of action or outcome for a particular situation or set of scenarios. The Engine Works Blog has an introduction to this toolset, plus an extensive use case demonstration.
If you need more Optimization and Simulation action, there are several sample workflows, including Fantasy Sports Lineups (hey, sports fans – blog post here!), a mixing problem, workforce scheduling, and more!
Speaking of use cases, the software itself contains a plethora of predictive sample workflows - and the installed Starter Kits show up here, too! Help > Sample Workflows > Predictive Analytics.
Of course, don't forget the Predictive Analytics help pages, for overviews and configuration tips.
Visit our Product Training page for On-Demand and Virtual webinars on everything Predictive – regression modelling, cluster analysis, time series… As always, please begin with Data Prep and Investigation! Can I mention the Field Summary Tool enough times?
Want to show off the interactive visualizations from the models you've built? This Knowledge Base post shows you how. Another Engine Works post outlines how to build your own Custom Interactive Visualizations (Part 1 and counting…)
For the most in-depth, resource-rich training on leveraging predictive analytics to answer your business questions, consider the Udacity Predictive Analytics for Business NanoDegree. It consists of seven courses focused on selecting the right methodology, data preparation, and data visualization as well as four courses that will equip you to use predictive analytics to answer your business problems.
But really, it all starts with the Community. Cruise the Knowledge Base posts, search for Predictive or other favorite keywords, follow the blogs… and for the love of Ned, just play with the software! It's how we learn :)
Occasionally you may see one of these errors from the Join Multiple tool. It is a result of Cartesian joins.
A Cartesian join is when you join every row of one table to every row of another table. You can also get one by joining every row of a table to every row of itself. A Cartesian join is very CPU intensive.
For example, if you have four files and each file has the same ID in it twice, that means it will join 2*2*2*2 times on the ID (the field on which you're joining is the key referenced in the error; in this example, it's Field1, and the duplicated value is 12345). The same can be caused by multiple nulls in each file.
After your data prep and investigation, and when you know your data are correct, your choices on how to handle Cartesian joins include:
Allow multidimensional joins: The multidimensional join will occur with no error or warning reported.
Warn on multidimensional joins of more than 16 records: A warning will be reported in the Results window that a multidimensional join has occurred.
Error on multidimensional joins of more than 16 records: An error will be reported in the Results window that a multidimensional join has occurred and downstream processing will stop.
Mosaic BG Dominant and Mosaic BG Household Distribution counts are balanced to Experian’s census estimates. ConsumerView is a marketing file and therefore doesn’t need to be balanced to the census estimates.
Welcome to Part 3 (out of 4) of the Pre-Predictive series. In this article series, we are introducing you to the very exciting world of data investigation. This section covers the Association Analysis Tool, The Pearson Correlation Tool, and the Spearman Correlation Tool!
Welcome to Part 2 of the Pre-Predictive series! After a strong start but long hiatus, we will be resuming our tour of the Data Investigation Tools. This section will cover the Frequency Table, Contingency Table and Distribution Analysis Tools.
You want to impress your managers, so you decide to try some predictions on your data – forecasting, scoring potential marketing campaigns, finding new customers… That's great! Welcome to the addictive world of predictive analytics. We have the perfect platform for you to start exploring your data.
I know you want to dive right in and start testing models. It's tempting to just pull some data and start trying out tools, but the first and fundamentally most important part of all statistical analysis is the data investigation.
Your models won't mean much unless you understand your data. Here's where the Data Investigation Tools come in! You can get a statistical breakdown of each of your variables, both string and numeric, check for outliers (categorical and continuous), test correlations to slim down your predictors, and visualize the frequency and dispersion within each of your variables.
Part 1 of this article will give you an overview of the Field Summary Tool (never leave home without it!) Part 2 will touch on the Contingency and Frequency Tables, and Distribution Analysis; Part 3 will be the Association Analysis Tool, and the Pearson and Spearman Correlations; and Part 4 will be all the cool plotting tools.
Always, every day, literally every time you acquire a new data set, you will start with the Field Summary Tool. I cannot emphasize this enough, and I promise it will save you headaches.
There are three outputs to this tool: a data table containing your fields and their descriptive statistics, a static report, and the interactive visualization dashboard that provides a visual profile of your variables. From this output, you can select subsets to view, sort each of the panels, view and zoom in on specific values, and it even includes a visual indicator of data quality.
You'll get a nifty report with plots and descriptive statistics for each of your variables. Likely the most important part of this report is '% Missing' – ideally, you want 0.0% missing. If you are missing values, don't fret. You can remove these records or impute those values (another reason knowing your data is so important).
Also check 'Unique Values' – if you have a single unique value in one of your variables, that won't add anything useful to your model, so consider deselecting that variable.
The Remarks field is also very useful – it will suggest field-type changes for fields with a small number of unique values, perhaps that should be a string field. Or, if some values of your field have a small number of value counts, you may consider combining some value levels together.
The better YOU know your data, the more efficient and accurate your models will be. Only you know your data, your use case, and how your results are going to be applied. But we're here to help you get as familiar as you can with whatever data you have.
Stay tuned for subsequent articles – these tools will be your new best friends. Happy Alteryx-ing!
This article was put together to resolve a common issue with cleansing your data as well as to show the use of tools and techniques that are not normally used for newer users. The goal of the article is to get newer users into these tools to open their creativity with the tool and hopefully take you to the next level!
In this use case, the data in the attached workflow is messy with capitalized strings all over the place. We want to format the data by removing some of the capitalization, but not all of it.
Note: If we wanted to make every first letter of the word capitalized we can use the Formula Tool and the TitleCase(String) function. This would make BEAR the WEIGHT - Bear The Weight. See the difference?
The tools that we will be using in this exercise is the Record ID, Text to Columns, RegEx, Formula, Tile, and Cross Tab Tools.
The exercise will show you the importance of using the Record ID Tool. The flexibility of the Text to Columns and RegEx Tools, the under-used Tile Tool, the creativity of the Formula Tool, and the not so scary Cross Tab tool when then data is configured properly.
We hope that these exercise and use cases open up your mind and the greatness of Alteryx!
Enjoy! The attached workflow is in version 10.5.