This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Welcome to Alteryx 11.0 new feature Data Profiling in the Browse Tool. We are excited about this and we hope you will be too! The Data Profiling in the Browse tool was created to assist the user in better understanding the quality of their data at any point within the workflow. This option will assist users in understanding the quality of their data, assist in troubleshooting and help fix issues that may arise when attempting to parse, join, or output their data.
Alteryx is designed to use all of the resources it possibly can. In order to make Alteryx run as fast as possible, it tries to balance the use of as much CPU, memory, and disk I/O as possible. The good news is that most of the resource utilization can be controlled. You can limit the amount of memory that is used on a system, user, or module level. The Sort/Join memory setting is not a maximum memory usage setting; it’s more like a minimum. One part of Alteryx (sorts) that benefits from having a big chunk of memory will take that entire amount right from the start. It will be split between all the sorts in your module, but other tools will still use memory outside that sort/join block. Some of them (e.g. drive times with a long maximum time) can use a lot. If a sorting can be done entirely in memory, it will go faster than if we have to fall back to temp files, so that’s why it’s good to set this higher. But if the total memory usage on the system pushes it into virtual memory, you’ll be swapping data to disk in a much less optimal way, and performance will be much worse and that’s why setting it too high is a bigger concern. The Default Dedicated Sort/Join Memory Usage can be found in the Designer at Options > User Settings > Edit User Settings Best Practices on Memory Settings 32-bit machines*: Setting should be on the lower, conservative side. No matter how much actual RAM is there, only has at maximum 1 GB available, as soon as it is set higher, the machine will cross over into virtual memory and be unable to recover. A 32-bit machine should never have a setting over 1000MB, and 512 is a good setting. Set it low (128 MB), especially when using Adobe products simultaneously with Alteryx. 64-bit machines: Set this in the system settings to half your physical memory divided by the number of simultaneous processes you expect to run. If you have 8 GB of RAM and run 2 processes at a time, your Sort/Join memory should be set to 2GB. You might set it lower if you expect to be doing a lot of memory intensive stuff on the machine besides Alteryx Set your Dedicated Sort/Join Memory Usage lower or higher on a per-module basis depending on the use of your computer, doing memory intensive non-sort work (i.e. large drive-times) then lower it, doing memory intensive sort-work then higher.
*Please refer to this link for additional details on 32-bit support for Designer
Here are the two most recommended best practices for optimizing module speed: Disable your Browse tools
Disable your browse tools via the Module Properties window. This checkbox can allow your module to run faster, and take up less memory and temp space by preventing Alteryx from having to generate the content (temporary .yxdb files) that must be displayed in your Browse tools.
Close the Output Window
If you find that your module produces warnings, go ahead and close the Output Window in order to speed things up for your module. This is a feature that if asked to display hundreds or thousands of lines of information, can slow down your module. To reopen the Output Window, go to View > Show Output Window.
This question is one of the most commonly asked questions from clients I interact with and some treat it as a personal challenge to see how fast they can get their module to run. However, in the famous words of our CTO Ned Harding "Although Alteryx can be very fast, since it is such a general tool, it is only as good as the module that you have authored".
There are multiple strategies to improving the speed of a workflow from using a select tool to reduce field sizes or looking at the default sort join memory, however the first fundamental process is benchmarking.
The recommended process is to run the workflow three times to ensure the data has cached.
If the workflow has not cached the data then this can cause slower run times, so to ensure this is a fair test running the workflow three times should ensure all the data is cached.
If you want to run without cached data you will have to reboot your machine between runs.
Once you have the total time the workflow takes to complete/run you can now look at optimizing.
Optimizing your workflow!
Alteryx is designed to use all of the resources it possibly can. In order to make Alteryx run as fast as possible, it tries to balance the use of as much CPU, memory, and disk I/O as possible.
Set your Dedicated Sort/Join Memory Usage lower or higher on a per-Workflow basis depending on the use of your computer.
Sort work refers to the sort tool and other similar tools in re-ordering your data. Join work refers to any of the join processes.
If you are doing memory intensive non-sort work (i .e . large drive-times) then lower it!
If you are doing memory intensive sort-work then higher it.
Go to the Workflow Configuration > Runtime tab > Dedicated Sort/Join Memory Usage > Use Specific Amount
The Sort/Join memory setting is not a maximum memory usage setting; its more like a minimum, this allocated memory will be split between all the tools that sort in your workflow, but other tools will still use memory outside that sort/join block. Some of them (e .g . drive times with a long maximum time) can use a lot.
Where do I find the Sort/Join memory options?
To set a user level default dedicated Sort/Join Memory Usage, go to Options > User Settings > Edit User Settings > Defaults tab.
The global Default Dedicated Sort/Join Memory Usage at System level can be found at Alteryx > Options > Advanced Options > System Settings > Engine > Default sort/join memory usage (MB).
*******For machine bit version memory considerations please see here.
Lean for more speed!
A best practice to optimize the performance of your workflows is to remove data that won’t be needed for downstream processing as quickly as possible. You can always bring that data back into the workflow later if necessary.
The select tool removes fields or columns from your data. Other tools such as join, join multiple, spatial match, find nearest, and to a certain degree Transform tools and Reporting tools have some select functionality that you can utilize within the tool to reduce the need to add additional select tools.
Another good way to optimize workflow performance is using the filter tool to remove unnecessary data.
The filter tool queries records in your file that meet specified criteria and identifies these records in your data, such as ZIP = 01001 . You may choose to handle records that come from the True output differently than the false output by connecting additional tools to the workflow on either side. This will allow smaller amounts of data to be passed downstream.
Auto Field Tool
Optimize your workflow for speed by setting the field type to the most efficient type and smallest possible size.
Use the auto field tool right after your Input Data tool to assign the most efficient type and size to your fields.
Another benefit of using the auto field tool is that it will reduce the size of your output file.
Enable Performance Profiling
This option will allow you to see a milliseconds and percentage breakdown per tool in your workflow.
Having this breakdown will allow you to pinpoint the slower tools/processes in your workflow and use the methods suggested in this article to improve that tool/process.
Performance profiling can be found Workflow > Runtime > Enable Performance Profiling.
Disable All Browse tools
The Browse tool quickly becomes a data artisans best friend, it allows to see/review the entire data at any given step in the workflow building process, however, each of these browse tools creates a temporary yxdb and writing these files do take some time and slow down the processing.
There is an option to simply disable them so they can be easily enabled if need it. This setting can be found at Workflow > Runtime > Disable All Browse Tools.
Set your limits: Record Limit for the Inputs
When developing your Workflow, there is no need to bring in all your data during testing.
Use the Record Limit option in the Properties for the Input to bring enough records for testing.
If you want to set limits for all input tools in your workflow, you can also do this under the Runtime tab under Workflow – Configuration.
The tool container allows the user to organize a workflow better by combining tools in logical groups.
Tool Containers can be disabled to run only certain portions of the workflow, effectively bypassing tools for a quicker run.
Designer now has the ability to cache data from relational databases through the input tool.
When checked, data is stored in an yxdb file on disk so that data sources are not hit repeatedly during workflow development.
Data can only be cached when running a workflow in an Alteryx Designer session. The setting is ignored when the workflow is run in the scheduler, in the gallery, or from the command line.
The Connection progress is a great way to keep track of the number of records and the size of the data going from one tool to another. In addition to that, the thickness of he connection itself varies depending on the size of data passing through (great for troubleshooting).
The default setting for the Connection Progress is “Show Only When Running” however leaving this set as ‘Show” will allow you to investigate the size of the data at certain points permanently (Properties for the Canvas > Connection progress).
If you want more detail on any of the points mentioned above make sure to check out the great Tips and Tricks articles from Margarita Wilshire et al!
Tips & Tricks 2016
Tips & Tricks 2015
Tips & Tricks 2014
I have had several questions from clients over the last few weeks looking to use Alteryx as the ultimate ‘middle man’ between databases and their end output, without explicitly writing anything to memory in the process. Alteryx will happily oblige in this domain whilst also providing seamless data blending & advanced analytics in the processes.
Here are some potential ways you can achieve this goal:
Using our In-Database tools means you never actually bring data down into memory but when you are ready you can use our Data Stream Out Tooland seamlessly pass this into our analytic tool set or to your output location.
No Browse Tools
In addition to the above point if you choose not to add Browse Tools, Alteryx will only cache around 1 MB of data at each tool in a temp location. This temp location is then cleared when you close Alteryx. Therefore, it is only kept in memory for the duration of the development of the workflow rather than indefinitely. Your default temp location can be found in Workflow Properties.
Changing workflows or Applications
When writing out data you need not write to hard-coded paths - you can reference “%temp% in the file path. This will then write to the default location set in workflow properties outlined above. You can then reference this file location in the next workflow by using a combination of the Directory, Sort, Sample, and the Dynamic Input Tools to read in that file. Alteryx will as default write an Alteryx Engine file for each run. Using the above tools will allow you to dynamically read in the latest file and data. The bonus is that these engine files get cleared out on a scheduled basis so the cached data will not exist in memory over time. The workflow depicting this (attached) was built in Alteryx 10.6.
Output to the database or via one of our connectors which use an API
You can utilize the above method mentioned in ‘No Browse Tools’ however at the end of your workflow you can output directly to a database using one of our connectors via an API. As Alteryx and Tableau work together, often clients will use the Publish to Tableau Server Macro to take data directly from an Alteryx workflow and up to tableau without keeping any data in memory. Again the 1MB of cached data will be removed from the Alteryx engine files and Alteryx pushes the data via a ‘POST’ command directly to the Tableau Server.
Any tool that performs a Sort will increase processing time. Many tools in Alteryx utilize a sorting process as a facet of their inherent functionality or if a group by option is used. The Summarize tool does this, for example. Alteryx has the ability to remember Sorts, or said another way, it knows if data has already been sorted upstream. This is a behavior that you can take advantage of as yet another means of optimizing your module’s processing. In the example shown, the process in Green is more efficient, as sorting will only be performed on the data once. For the process in Red, sorting will have to be performed twice.
Tools that Sort
Tools that use Sorts:
Tools use Temp Files and/or Internal Sorts:
Block Until Done
Multi Row Formula
Distance (Guzzler Only)
Trade Area (Guzzler Only or Non-Overlapping)
Run Command (Temp files only)
Use Temp Files and/or Internal Sorts at the Process level:
Batch Macros (for secondary inputs)
Iterative Macros (for secondary inputs)
Behavior Create Profile
When bringing data into Alteryx a lot of users often add a select tool to check the data type and structure. Data types are very important because of the available operations/functions in tools can be adjusted to fit the data type being used!
You'll notice some of the menus have been rearranged in version 10+. So if you are looking to export a workflow package, you will find that under the Options menu, Export Workflow . By default, you will see all the assets that the workflow depends on to run properly, but sometimes you will want to include additional files to package with the workflow.
Analytic Starter Kits teach you how to prep, blend, and analyze your data faster than using tools like Excel or SQL coding; guide you through how to build sophisticated insights such as market basket analysis or predictive A/B testing; and show you how to output beautiful Tableau , Microsoft Power BI , Qlik or Salesforce Wave visualizations to explore and understand your data and insights.
Copy: Data can be copied directly from a Browse tool, just highlight the cells by clicking and dragging, much like you would in an old-fashioned spreadsheet. In the top right hand corner of the Browse tool’s Properties, click on the “Copy to Clipboard” button, four choices will appear. For this demonstration, we have chosen “Selected Cells with Headers”, the data is now copied to our clipboard.
Paste: To paste the data into a module, try the technique mentioned above. Simply right click on your canvas and click Paste. Bonus Thought: Suppose you are developing a large module, and you’ve employed Tool Containers as a part of your module’s design process. How might you use the above method of copy/pasting data to eliminate the need to Run the completed portions of your module, but still supply data to downstream processes that you are working on? Answer: Place a Browse at the output of a completed Tool Container process; copy the contents of the Browse and paste them onto the Canvas as a Text Input. Disable the Tool Container and use the Text Input as a substitute feed while you complete your module.
Truncated data is usually defined as numeric rounding or cut off, string shortening, or datum deletion - essentially any time information is lost. Since, as analysts, our insights are only as good as our data, we usually find ourselves trying to preserve the integrity of data as we’re processing it. That’s not to say we can’t also optimize our resources usage when our data will allow for it.
While keeping your data types as small as possible is important, and can serve to shorten run times, it is even more important to understand what data types are most accommodating to your data and in what situations they can be shortened without truncation. If you’ve ever seen Office Space, you understand just how important even fractions of a cent are – and accuracy doesn’t just apply to financial data.
To avoid truncation in your data you have to first explore your data types in the context of your analyses. What data types and sizes did you receive the data in and why? What format would we like to see our results in? Are there opportunities to reduce memory usage between those two? And finally - what operations will we perform on the fields in our workflows that may impact each data type and size? The answers to these questions will be unique to each dataset, but once they’re addressed you can use the same techniques to keep your data both optimized and accurate.
Start by identifying the data types that most closely fits your fields based on the questions above. If you want Alteryx’s best guess, try using the Auto Field Tool to assign optimized data types and sizes automatically. While this tool is immensely helpful, be sure to check that the output is not truncating data or leaving it in a form less conducive to your downstream analyses – the Auto Field Tool doesn’t know the answers to your questions above. You can have the best of both worlds by adjusting the assignments from the Auto Field Tool, where necessary, by placing a Select Tool (master it here) just afterwards. You’ll then have suggestions and be able to change the less accurate/accommodating assignments by hand. Some things to consider:
If performing string operations later in your workflow that may increase their length, pay close attention to see if they are being truncated after that maximum string value is reached. String and WString (accepts Unicode characters) types are set length and will drop any characters that exceed their size. On the other hand, V_string and V_WString (accepts Unicode characters) are of variable length, and will adjust to accommodate strings after assignment.
Numerics may seem the most intuitive of the bunch, but pay close, close attention to the precision of each type so as to avoid unintentional rounding. Fixed Decimal is the only type to have an adjustable length – the rest may force your data to fit if not assigned to the correct type.
Dates are not always input in the standard ISO format yyyy-mm-dd HH:MM:SS, and may require some converting in order to handle them as Date/Time types in the Designer (this is important when trying to sort or filter by date or apply Date/Time functions in formulas). Any other date format will be interpreted as a string, and treated as such, unless converted using the DateTimeParse() function, the DateTime Tool, the Parse Dates Macro, or the fan favorite BB Date Macro. If you need to return to your original formatting, or simply prefer another, you can always convert your dates back into another format string after they’ve been processed using the DateTimeFormat() function or the DateTime Tool.
How can I share my workflows if my recipients don’t have the same files they reference? Sending them all over separately and reconfiguring inputs separately seems like a lot of work. Halp!
You’re right, that does sound like a lot of work. Luckily, we have an export feature to help with this exact scenario:
Navigating in your Designer to Options >> Export Workflow will open a menu where you can include assets that are referenced in your workflow and export them into a single .yxzp file to be shared. Feel free to select whichever files you’d like to include – if a file is missing, try attaching an additional asset/file to the tool it’s associated with. At that point it should appear in your export options:
Note: If your workflow uses a database connection, your recipient will need to either 1) have a DSN set up (system preferred) for the connection string used or 2) use the same alias to establish a connection in their environment.
After you select the files to be included, select “Browse” to specify the location and name of the export:
You should be all set! Share this export with your files included.
To open an exported workflow, simply File >> Open Workflow >> Browse in Alteryx or double click the green .yxzp file like any other Alteryx file type:
You should then see a prompt stating that the file is an Alteryx Package; just hit “Yes” to begin the import:
You’ll then be given the option to change the directory that the export is extracted to; below that option you will see the exported files listed with their locations relative to the destination directory:
Select “Import” to proceed:
Once the export has successfully extracted, you will be given a notification that the import process has completed. Select “Yes” and your exported workflow should already be loaded and ready to run!
Where can I find a vailable "Big Data Sets" over the internet?
Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year-long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples...
Amazon (AWS) has a Large Data Sets Repository
Data.gov has close to 190k public data sets
One of the standard datasets for Hadoop is the Enron email dataset comprising emails between Enron employees during the scandal. It's a great practice dataset for dealing with semi-structured data (file scraping, regexes, parsing, joining, etc.). It's ~400MB (compressed) and available for download at http://www.cs.cmu.edu/~enron/
Collection of audio features and metadata for a million contemporary popular music tracks http://labrosa.ee.columbia.edu/millionsong/ . SecondHandSongs dataset -> cover songs musiXmatch dataset -> lyrics Last.fm dataset -> song-level tags and similarity Taste Profile subset -> user data thisismyjam-to-MSD mapping -> more user data tagtraum genre annotations -> genre labels Top MAGD dataset -> more genre labels You can either download the entire dataset (280 GB) or a subset of 10,000 songs (1.8) for a quick taste.
GDELT set: http://www.gdeltproject.org/data.html
NY City taxi data sets 1.1BN records: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data
Airline data set 1987-2008: https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O
Google Big Table - hosted by Google: https://cloud.google.com/bigquery/sample-tables Weather, timeline of actions such as pull requests and comments on GitHub repositories with a nested or flat schema, US births 1969-2008, Shakespeare - number of times each word appears, Wikipedia articles over 300,000,000 million rows.
LENDING CLUB: https://www.lendingclub.com/info/download-data.action
Here is a telecom Italia dataset as a result of a computation over the Call Detail Records (CDRs) generated by the Telecom Italia cellular network over the city of Milano; You may have to sign-in and activate your account but it's totally free... https://dandelion.eu/datagems/SpazioDati/telecom-sms-call-internet-mi/description/
Data Science Centralhttp://www.datasciencecentral.com/profiles/blogs/big-data-sets-available-for-free
KD Nuggets is a well respected analytics blog, they have put together a very nice and deep list: http://www.kdnuggets.com/datasets/index.html
UK Data https://data.gov.uk/data
Google's Public Data Directory: http://www.google.com/publicdata/directory
For the Spatial and GIS folks: http://gisgeography.com/best-free-gis-data-sources-raster-vector/
The mother of big datasets - Reddit. 1.7bn JSON objects; 250GB compressed. https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment
Loads of really great links from here as well: https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
http://www.imdb.com/interfaces A subset of the IMDb plain text data files is available from their FTP sites as follows: ftp.fu-berlin.de (Germany) ftp.funet.fi (Finland)
One of my favorites are Data.gov where there is tons of public data from all sectors, different size sets and in different formats including API connections. This url, http://www.data.gov/open-gov/ , shows each of the local governments in the US. They have varying degrees of completion on the local level.
The Government of Canada has an Open Data portal -- http://open.canada.ca/en/open-data -- it takes some digging to find the gems, but there are some.
There's also some open mapping data at -- http://open.canada.ca/en/open-maps.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open-source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions. https://cloud.google.com/bigquery/public-data/github
15.49TB of research data available. http://academictorrents.com/
Australia, New South Wales Open data http://data.nsw.gov.au/
USAFacts: Our Nation, in numbers. Federal, state, and local data from over 70 government sources.
What are some "Small Data Sets" available over the internet?
Small data is data that is small enough size for human comprehension. A few thousand lines of credit data or marketing segmentation example data, B2B client contact history of a firm are some examples...
kaggle.com "Kaggle is a platform for predictive modeling and analytics competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models. This crowdsourcing approach relies on the fact that there are countless strategies that can be applied to any predictive modeling task and it is impossible to know at the outset which technique or analyst will be most effective." There are multiple available small datasets that you can test your skills on: https://www.kaggle.com/c/informs2010 - The goal of this contest is to predict short term movements in stock prices. https://www.kaggle.com/c/axa-driver-telematics-analysis - Use telematic data to identify a driver signature. https://www.kaggle.com/c/sf-crime - Predict the category of crimes that occurred in the city by the bay. You may find 202 more under the following link https://www.kaggle.com/competitions/search?DeadlineColumnSort=Descending
Kaggle has started a section called Kaggle Datasets, that has public datasets that you can use as datasets for the competitions were often restricted for use outside the competition. https://www.kaggle.com/datasets
Kaggle also has scripts for processing the given data sets: https://www.kaggle.com/scripts , which are usually in R or Python. It can be instructive to look at those and discern which parts can be pulled into standard Alteryx tools, and which parts left to a custom R call, for instance. The nice thing is that, once you've finished, you can submit your output to the relevant Kaggle competition (even after the fact) to see how your output stacks up to the competition.
"Small Data" set to test your skills on Duplicate Detection, Record Linkage, and Identity Uncertainty http://www.cs.utexas.edu/users/ml/riddle/data.html
Here is an addition from Europe...http://open-data.europa.eu/en/data/ "The European Union Open Data Portal is the single point of access to a growing range of data from the institutions and other bodies of the European Union (EU). Data are free for you to use and reuse for commercial or non-commercial purposes. By providing easy and free access to data, the portal aims to promote their innovative use and unleash their economic potential. It also aims to help foster the transparency and the accountability of the institutions and other bodies of the EU."
Data Integrity refers to the accuracy and consistency of data stored in a database, data warehouse, data mart or other construct, and it is a fundamental component of any analytic workflow. In Alteryx, creating a macro to compare expected values to actual values in your data is quite simple and provides a quality control check before producing a visual report. Let me show you how to build this.
The two inputs represent the actual and expected values in your data. These data streams are passed through a Record ID tool to keep positional integrity and then passed on to the Transpose tool to create two columns. The first column contains the field names and the second column shows the values within each field. This data is then passed on to a join, matching on Record ID and the Name of the field, in order to compare each value. Lastly, if the data does not match from expected to actual, a custom message will appear in the results messages alerting the user where the mismatch happened within the dataset. The image below shows the error message produced if values differ across datasets.
A few months ago, I found the free space on my hard drive shrinking significantly on a daily basis, but for the life of me I couldn't figure out the cause. As it turned out, I was suffering from a Windows issue that I was not alone in dealing with. I was able to get the issue resolved by our IT department, but not before I spent a fair amount of time searching through my directories, compressing and deleting files, running Disk Cleanup, etc., none of which was particularly helpful. So, I decided to look to Alteryx for an answer.
I created a simple workflow (attached and created in 11.0) which uses the Directory tool to read the attributes of all of the files on my computer. Then, I grouped the files by directory and summarized the file size. After a sort, I instantly found the culprit. There were hundreds of CAB files (a Windows archive file format) that were 100 MBs each which were being created erroneously by Windows. I was able to share this with my IT associate which helped him greatly in diagnosing and subsequently fixing the issue.
Most of you are unlikely to be experiencing the same issue (but if you are, you're welcome ). However, this workflow may come in handy for other reasons. If anything, just to help you manage your files and disk space by pointing out which files are hogging up all of your hard drive space!
Thanks for reading!