This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
This short, but packed demonstration will show you why tens of thousands of data analysts from more than 1,800 companies rely on Alteryx daily to prep, blend, and analyze data, to deliver deeper business insights in hours, not weeks.
The Find Replace Tool is one of those tools that goes relatively unused and uncelebrated until you stumble into a data blending technique that would be extremely difficult without it – at which point, it becomes your favorite tool in the Designer. You can find it in the Join Category and it’ll make easy string substitutions in your data that would otherwise require herculean effort to work around. Today, we celebrate Find Replace as a hero.
Sampling weights, also known as survey weights, are positive values associated with the observations (rows) in your dataset (sample), used to ensure that metrics derived from a data set are representative of the population (the set of observations).
How do you use the Arrange Tool in Alteryx?
The Arrange tool allows you to manually transpose and re arrange your data fields for presentation purposes. Data is transformed so that each record is turned into multiple records and columns can be created by using field description data.
Set the Arrange tool.
Key Fields : Select columns from your data stream. Create and manipulate Output Fields . To create a new ouput field, click Column and select Add to open the Add Column window. Column Header : Enter the name of the new column of data. Fill in Description Column : Select Add New Description to create a column containing your description value of the selected fields.
Please find the example Arrange.yxmd attached.
Where can I find a vailable "Big Data Sets" over the internet?
Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year-long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples...
Amazon (AWS) has a Large Data Sets Repository
Data.gov has close to 190k public data sets
One of the standard datasets for Hadoop is the Enron email dataset comprising emails between Enron employees during the scandal. It's a great practice dataset for dealing with semi-structured data (file scraping, regexes, parsing, joining, etc.). It's ~400MB (compressed) and available for download at http://www.cs.cmu.edu/~enron/
Collection of audio features and metadata for a million contemporary popular music tracks http://labrosa.ee.columbia.edu/millionsong/ . SecondHandSongs dataset -> cover songs musiXmatch dataset -> lyrics Last.fm dataset -> song-level tags and similarity Taste Profile subset -> user data thisismyjam-to-MSD mapping -> more user data tagtraum genre annotations -> genre labels Top MAGD dataset -> more genre labels You can either download the entire dataset (280 GB) or a subset of 10,000 songs (1.8) for a quick taste.
GDELT set: http://www.gdeltproject.org/data.html
NY City taxi data sets 1.1BN records: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data
Airline data set 1987-2008: https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O
Google Big Table - hosted by Google: https://cloud.google.com/bigquery/sample-tables Weather, timeline of actions such as pull requests and comments on GitHub repositories with a nested or flat schema, US births 1969-2008, Shakespeare - number of times each word appears, Wikipedia articles over 300,000,000 million rows.
LENDING CLUB: https://www.lendingclub.com/info/download-data.action
Here is a telecom Italia dataset as a result of a computation over the Call Detail Records (CDRs) generated by the Telecom Italia cellular network over the city of Milano; You may have to sign-in and activate your account but it's totally free... https://dandelion.eu/datagems/SpazioDati/telecom-sms-call-internet-mi/description/
Data Science Centralhttp://www.datasciencecentral.com/profiles/blogs/big-data-sets-available-for-free
KD Nuggets is a well respected analytics blog, they have put together a very nice and deep list: http://www.kdnuggets.com/datasets/index.html
UK Data https://data.gov.uk/data
Google's Public Data Directory: http://www.google.com/publicdata/directory
For the Spatial and GIS folks: http://gisgeography.com/best-free-gis-data-sources-raster-vector/
The mother of big datasets - Reddit. 1.7bn JSON objects; 250GB compressed. https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment
Loads of really great links from here as well: https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
http://www.imdb.com/interfaces A subset of the IMDb plain text data files is available from their FTP sites as follows: ftp.fu-berlin.de (Germany) ftp.funet.fi (Finland)
One of my favorites are Data.gov where there is tons of public data from all sectors, different size sets and in different formats including API connections. This url, http://www.data.gov/open-gov/ , shows each of the local governments in the US. They have varying degrees of completion on the local level.
The Government of Canada has an Open Data portal -- http://open.canada.ca/en/open-data -- it takes some digging to find the gems, but there are some.
There's also some open mapping data at -- http://open.canada.ca/en/open-maps.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open-source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions. https://cloud.google.com/bigquery/public-data/github
15.49TB of research data available. http://academictorrents.com/
Australia, New South Wales Open data http://data.nsw.gov.au/
USAFacts: Our Nation, in numbers. Federal, state, and local data from over 70 government sources.
What are some "Small Data Sets" available over the internet?
Small data is data that is small enough size for human comprehension. A few thousand lines of credit data or marketing segmentation example data, B2B client contact history of a firm are some examples...
kaggle.com "Kaggle is a platform for predictive modeling and analytics competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models. This crowdsourcing approach relies on the fact that there are countless strategies that can be applied to any predictive modeling task and it is impossible to know at the outset which technique or analyst will be most effective." There are multiple available small datasets that you can test your skills on: https://www.kaggle.com/c/informs2010 - The goal of this contest is to predict short term movements in stock prices. https://www.kaggle.com/c/axa-driver-telematics-analysis - Use telematic data to identify a driver signature. https://www.kaggle.com/c/sf-crime - Predict the category of crimes that occurred in the city by the bay. You may find 202 more under the following link https://www.kaggle.com/competitions/search?DeadlineColumnSort=Descending
Kaggle has started a section called Kaggle Datasets, that has public datasets that you can use as datasets for the competitions were often restricted for use outside the competition. https://www.kaggle.com/datasets
Kaggle also has scripts for processing the given data sets: https://www.kaggle.com/scripts , which are usually in R or Python. It can be instructive to look at those and discern which parts can be pulled into standard Alteryx tools, and which parts left to a custom R call, for instance. The nice thing is that, once you've finished, you can submit your output to the relevant Kaggle competition (even after the fact) to see how your output stacks up to the competition.
"Small Data" set to test your skills on Duplicate Detection, Record Linkage, and Identity Uncertainty http://www.cs.utexas.edu/users/ml/riddle/data.html
Here is an addition from Europe...http://open-data.europa.eu/en/data/ "The European Union Open Data Portal is the single point of access to a growing range of data from the institutions and other bodies of the European Union (EU). Data are free for you to use and reuse for commercial or non-commercial purposes. By providing easy and free access to data, the portal aims to promote their innovative use and unleash their economic potential. It also aims to help foster the transparency and the accountability of the institutions and other bodies of the EU."
Does "Dictionary Sort Order" always place lower case letters before capital letters?
Yes. In the Sort-Configuration menu there is an option to "Use Dictionary Order". When checked it will sort in alphabetical order with lower case first (e.g., a, A, b, B, c, C, etc.).
If you do not have "Use Dictionary Order" checked, it will sort all Upper case first and then all lower case (e.g., A, B, C, a, b, c, etc.).
Check "Use Dictionary Sort Order.
Dictionary Sort Order
Visit the sort help article or the attached workflow for more details.
Is there a way to avoid the Cross Tab Tool from transferring all the input information alphabetically? Simply add a RecordID to your records and add the RecordID field as a grouping field in your Cross Tab Tool to keep the order!
To avoid the transferring all the input information alphabetically in the Cross Tab Tool, you can add a RecordID for all the records of the input. Then add RecordID as the Grouping Field in the Cross Tab Tool.
Add RecordID. Then in the Cross Tab tool, group by RecordID.
With the Python Tool, Alteryx can manipulate your data using everyone’s favorite programming language - Python! Included with the tool are a few of pre-built libraries that extend past even the native Python download. This allows you to extend your data manipulation even further than one could ever imagine. The libraries installed are listed here - and below I’ll go into a bit more detail on what and why these libraries are so useful.
Each library is well documented, and there’s usually an introduction or examples on their sites to get you started on how a basic function in their library works.
ayx – Alteryx API – simply enough, we’re using Alteryx, sooo yea, kind of a requirement for the translation between Alteryx and Python.
jupyter – Jupyter metapackage – If you’ve used a Jupyter notebook in the past, you’ll notice the interface for the Python Tool is similar. This interface allows you to run sections of code outside of actually running the workflow, which makes understanding and testing your data that much easier.
matplotlib – Python plotting package – Any charting, plotting, or graphical needs you would want will be in this package. This provides a great deal of flexibility for whatever you want to visualize.
numPy – NumPy, array processing for numbers, strings, records, and objects – Native Python processes data in what some would call a cumbersome way. For instance, if you wanted to make a matrix, a.k.a. a 4x4 table, you would need to create a list within a list, which can slow processing a bit. However, NumPy has its own “array” type that fits the data in this matrix pattern that allows for faster processing. Additionally, it has a bunch of methods of handling numbers, strings, and objects that make processing a whole lot easier and a whole lot faster.
pandas – Powerful data structures for data analysis, time series, and statistics – This is your staple for handling data within Alteryx. Those who have used Python, but never pandas, will enter a whole new beautiful world of data handling and structure. Data manipulation within Python is faster, cleaner, and easier to code with. The best part about it is that the Python Tool will read in your Alteryx data as a pandas data frame! Understanding this library should be one of the first things to know when tackling the Python code.
requests – Python HTTP for Humans – for all the connector/Download Tool fans out there. If any of you are familiar with making HTTP requests (API calls and the like), then you should introduce yourselves to this package and explore how Python performs these requests.
scikit-learn – a set of Python modules for machine learning and data mining – Welcome to the world of machine learning in Python! This library is your go-to for statistical and predictive modeling and evaluation. Any crazy and wild methods you’ve learned for machine learning will most likely be found here and can really push the boundaries of data science.
scipy – Scientific Library for Python – all your scientific and technical computing can be found here. This library builds off the packages already installed here, like numPy, pandas, and matplotlib. Dealing with mathematical models and formulae are usually located within this library and can help provide that higher level analysis of your data.
six – Python 2 and 3 compatibility utilities – For those who are unfamiliar, Python versions come in 2 forms, version 2.x and 3.x (with 3.x being the most recent). Now, even though Python 3 is supposed to be the latest and greatest, there are still many users out there who prefer using Python 2. Therefore, integration between the two is a bit tricky with syntax differences, etc. The six module provides functions that are usable between the two so everyone can remain calm and happy! Their documentation is usually coupled with which version the functions most closely align to, so a user can get a better idea to its functionality.
SQLAlchemy – Database Abstraction Library – SQL in Python! Covers all your database needs from connecting to and extracting data, allowing it to interact with your Python code and thus, Alteryx itself.
statsmodels – statistical computations and models for Python – This library builds off sci-kit learn but focuses more on statistical tests and data exploration. Additionally, it utilizes R-style formulae with pandas data frames to fit models!
These are the libraries installed with the Python Tool, which can do almost any data function imaginable. Of course, if you’re looking to do something that these libraries don’t provide, there are myriad other Python libraries that I’m sure will help you with your use case. Most of these are also well documented in how to use so search away and let your mind float away in the beautiful cosmos created by Python.
When your Python libraries don't work the way they should in Python tool, restoring the tool to it's original state could be the solution. This article walks through how to restore Python libraries and the virtual environment associated with the Python tool.