This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
To get a better understanding of how to properly leverage a machine’s resources to use Alteryx, it can be very helpful to understand how the Alteryx Engine functions. To clear up any haziness surrounding the term “Alteryx Engine”, this article covers what happens when you click the Run Button
Unbeknown to many, by default Alteryx saves your open workflows every 10 minutes, keeps the last three iterations of them, and keeps them around for 30 days since the last save. As you can imagine, many people are very happy to find this out when they frantically come to Support hoping for some way to resurrect a deleted or corrupted workflow. I have always found Alteryx to be an incredibly stable program, but it is nice knowing that my workflows will still be there in an unfortunate emergency, self-inflicted or otherwise.
From my experience, this is something that a lot of people either ignore or don't realize is there. What most people do realize is that for some reason, every one of their workflows has a similarly named copy, but with a .bak extension. The important thing to know is that when you least expect it, this file can become a lifesaver if something happens to your original workflow, so please don't delete them!
To recover autosaved files: Go to File > Open Autosaved Files
To change the settings: Go to the 'Advanced' tab under Options > User Settings > Edit User Settings
Open an Autosaved Workflow
To open an autosaved file, simply go to the menu File > Open Workflow > Open Autosaved Files:
The following screen will open:
You simply click on the iteration of the workflow that you want to restore and then you have the choice of either opening that workflow or saving it to a directory. If you open the workflow directly from this screen, Alteryx will open up a temporary version of that workflow that you will need to save to an appropriate directory of your choosing.
Edit the Autosave Defaults
To edit the Autosave settings, go to 'Advanced' tab under Options > User Settings > Edit User Settings.
The Autosave options are highlighted in yellow below. As you can see below, you can adjust the interval in minutes that the workflow is saved, how many iterations of the workflow to save and how many days to keep the file.
As you can see, the Autosave feature is very straightforward, while at the same time it can be the most important feature you ever used!
Check it out in our Help too!
Alteryx includes a printing feature which allows you to print out a workflow you've developed.
To Print, navigate to File > Print
To change the Print Settings: File > Print Settings
This is a simple feature that does not require much explanation, so I will make this quick.
The Print Screen
Print the workflow or select a printer with the drop-down arrow.
Set the zoom level of the preview window with the down arrow or fit to the page by clicking on the magnifying glass.
Choose how many tiles to print to; larger workflows may require multiple tiles.
Scroll through the workflow pages.
Change the scale so that the entire workflow fits on a single page or the original or custom scale.
Change the format to Landscape or Portrait.
The preview of your workflow.
This feature is very useful when reviewing the overall workflow parts with colleagues. Enjoy!
Everyone has their own way of organizing their window setup in Alteryx and there are a few tricks to make your canvas as user-friendly as possible.
My new favorite is the Ctrl+Tab and holding down Ctrl. Similar to toggling internet browser windows, this allows the user to toggle between workflows, but also tool windows. You can then place tool windows in the same position and then toggle between them. This will allow more room on the user's canvas since all of the tool windows will be in one spot!
Another great feature to open up canvas space is the Auto Hide. This will create tabs for you with each of the tool windows and hide them until you select the tab or use the feature above. When your window is docked you can hover over the pin (see below), click the pin, and your window will change to Undockable and Auto Hide. You should then see a tab on the screen with the window's name.
We have many ways to configure your tool windows in Alteryx, so don't forget about the other options you have when you select the down arrow on the tool window (see below). These options can be beneficial when optimizing your view.
Floating: Allows users to move the window around the canvas as you please. This also allows you to move the window to another monitor.
Dockable: Allows users to place their window in a certain position on the canvas and will not move when selected or viewed.
Tabbed Document: This configuration will place your tool window in the same position as if it was another workflow.
Auto Hide: Hides the tool window until the tab is selected.
Hide: Hides the tool window.
Hope these tips help keep your canvas view!
Welcome to Alteryx 11.0 new feature Data Profiling in the Browse Tool. We are excited about this and we hope you will be too! The Data Profiling in the Browse tool was created to assist the user in better understanding the quality of their data at any point within the workflow. This option will assist users in understanding the quality of their data, assist in troubleshooting and help fix issues that may arise when attempting to parse, join, or output their data.
Alteryx is designed to use all of the resources it possibly can. In order to make Alteryx run as fast as possible, it tries to balance the use of as much CPU, memory, and disk I/O as possible. The good news is that most of the resource utilization can be controlled. You can limit the amount of memory that is used on a system, user, or module level. The Sort/Join memory setting is not a maximum memory usage setting; it’s more like a minimum. One part of Alteryx (sorts) that benefits from having a big chunk of memory will take that entire amount right from the start. It will be split between all the sorts in your module, but other tools will still use memory outside that sort/join block. Some of them (e.g. drive times with a long maximum time) can use a lot. If a sorting can be done entirely in memory, it will go faster than if we have to fall back to temp files, so that’s why it’s good to set this higher. But if the total memory usage on the system pushes it into virtual memory, you’ll be swapping data to disk in a much less optimal way, and performance will be much worse and that’s why setting it too high is a bigger concern. The Default Dedicated Sort/Join Memory Usage can be found in the Designer at Options > User Settings > Edit User Settings Best Practices on Memory Settings 32-bit machines*: Setting should be on the lower, conservative side. No matter how much actual RAM is there, only has at maximum 1 GB available, as soon as it is set higher, the machine will cross over into virtual memory and be unable to recover. A 32-bit machine should never have a setting over 1000MB, and 512 is a good setting. Set it low (128 MB), especially when using Adobe products simultaneously with Alteryx. 64-bit machines: Set this in the system settings to half your physical memory divided by the number of simultaneous processes you expect to run. If you have 8 GB of RAM and run 2 processes at a time, your Sort/Join memory should be set to 2GB. You might set it lower if you expect to be doing a lot of memory intensive stuff on the machine besides Alteryx Set your Dedicated Sort/Join Memory Usage lower or higher on a per-module basis depending on the use of your computer, doing memory intensive non-sort work (i.e. large drive-times) then lower it, doing memory intensive sort-work then higher.
*Please refer to this link for additional details on 32-bit support for Designer
This question is one of the most commonly asked questions from clients I interact with and some treat it as a personal challenge to see how fast they can get their module to run. However, in the famous words of our CTO Ned Harding "Although Alteryx can be very fast, since it is such a general tool, it is only as good as the module that you have authored".
There are multiple strategies to improving the speed of a workflow from using a select tool to reduce field sizes or looking at the default sort join memory, however the first fundamental process is benchmarking.
The recommended process is to run the workflow three times to ensure the data has cached.
If the workflow has not cached the data then this can cause slower run times, so to ensure this is a fair test running the workflow three times should ensure all the data is cached.
If you want to run without cached data you will have to reboot your machine between runs.
Once you have the total time the workflow takes to complete/run you can now look at optimizing.
Optimizing your workflow!
Alteryx is designed to use all of the resources it possibly can. In order to make Alteryx run as fast as possible, it tries to balance the use of as much CPU, memory, and disk I/O as possible.
Set your Dedicated Sort/Join Memory Usage lower or higher on a per-Workflow basis depending on the use of your computer.
Sort work refers to the sort tool and other similar tools in re-ordering your data. Join work refers to any of the join processes.
If you are doing memory intensive non-sort work (i .e . large drive-times) then lower it!
If you are doing memory intensive sort-work then higher it.
Go to the Workflow Configuration > Runtime tab > Dedicated Sort/Join Memory Usage > Use Specific Amount
The Sort/Join memory setting is not a maximum memory usage setting; its more like a minimum, this allocated memory will be split between all the tools that sort in your workflow, but other tools will still use memory outside that sort/join block. Some of them (e .g . drive times with a long maximum time) can use a lot.
Where do I find the Sort/Join memory options?
To set a user level default dedicated Sort/Join Memory Usage, go to Options > User Settings > Edit User Settings > Defaults tab.
The global Default Dedicated Sort/Join Memory Usage at System level can be found at Alteryx > Options > Advanced Options > System Settings > Engine > Default sort/join memory usage (MB).
*******For machine bit version memory considerations please see here.
Lean for more speed!
A best practice to optimize the performance of your workflows is to remove data that won’t be needed for downstream processing as quickly as possible. You can always bring that data back into the workflow later if necessary.
The select tool removes fields or columns from your data. Other tools such as join, join multiple, spatial match, find nearest, and to a certain degree Transform tools and Reporting tools have some select functionality that you can utilize within the tool to reduce the need to add additional select tools.
Another good way to optimize workflow performance is using the filter tool to remove unnecessary data.
The filter tool queries records in your file that meet specified criteria and identifies these records in your data, such as ZIP = 01001 . You may choose to handle records that come from the True output differently than the false output by connecting additional tools to the workflow on either side. This will allow smaller amounts of data to be passed downstream.
Auto Field Tool
Optimize your workflow for speed by setting the field type to the most efficient type and smallest possible size.
Use the auto field tool right after your Input Data tool to assign the most efficient type and size to your fields.
Another benefit of using the auto field tool is that it will reduce the size of your output file.
Enable Performance Profiling
This option will allow you to see a milliseconds and percentage breakdown per tool in your workflow.
Having this breakdown will allow you to pinpoint the slower tools/processes in your workflow and use the methods suggested in this article to improve that tool/process.
Performance profiling can be found Workflow > Runtime > Enable Performance Profiling.
Disable All Browse tools
The Browse tool quickly becomes a data artisans best friend, it allows to see/review the entire data at any given step in the workflow building process, however, each of these browse tools creates a temporary yxdb and writing these files do take some time and slow down the processing.
There is an option to simply disable them so they can be easily enabled if need it. This setting can be found at Workflow > Runtime > Disable All Browse Tools.
Set your limits: Record Limit for the Inputs
When developing your Workflow, there is no need to bring in all your data during testing.
Use the Record Limit option in the Properties for the Input to bring enough records for testing.
If you want to set limits for all input tools in your workflow, you can also do this under the Runtime tab under Workflow – Configuration.
The tool container allows the user to organize a workflow better by combining tools in logical groups.
Tool Containers can be disabled to run only certain portions of the workflow, effectively bypassing tools for a quicker run.
Designer now has the ability to cache data from relational databases through the input tool.
When checked, data is stored in an yxdb file on disk so that data sources are not hit repeatedly during workflow development.
Data can only be cached when running a workflow in an Alteryx Designer session. The setting is ignored when the workflow is run in the scheduler, in the gallery, or from the command line.
The Connection progress is a great way to keep track of the number of records and the size of the data going from one tool to another. In addition to that, the thickness of he connection itself varies depending on the size of data passing through (great for troubleshooting).
The default setting for the Connection Progress is “Show Only When Running” however leaving this set as ‘Show” will allow you to investigate the size of the data at certain points permanently (Properties for the Canvas > Connection progress).
If you want more detail on any of the points mentioned above make sure to check out the great Tips and Tricks articles from Margarita Wilshire et al!
Tips & Tricks 2016
Tips & Tricks 2015
Tips & Tricks 2014
I have had several questions from clients over the last few weeks looking to use Alteryx as the ultimate ‘middle man’ between databases and their end output, without explicitly writing anything to memory in the process. Alteryx will happily oblige in this domain whilst also providing seamless data blending & advanced analytics in the processes.
Here are some potential ways you can achieve this goal:
Using our In-Database tools means you never actually bring data down into memory but when you are ready you can use our Data Stream Out Tooland seamlessly pass this into our analytic tool set or to your output location.
No Browse Tools
In addition to the above point if you choose not to add Browse Tools, Alteryx will only cache around 1 MB of data at each tool in a temp location. This temp location is then cleared when you close Alteryx. Therefore, it is only kept in memory for the duration of the development of the workflow rather than indefinitely. Your default temp location can be found in Workflow Properties.
Changing workflows or Applications
When writing out data you need not write to hard-coded paths - you can reference “%temp% in the file path. This will then write to the default location set in workflow properties outlined above. You can then reference this file location in the next workflow by using a combination of the Directory, Sort, Sample, and the Dynamic Input Tools to read in that file. Alteryx will as default write an Alteryx Engine file for each run. Using the above tools will allow you to dynamically read in the latest file and data. The bonus is that these engine files get cleared out on a scheduled basis so the cached data will not exist in memory over time. The workflow depicting this (attached) was built in Alteryx 10.6.
Output to the database or via one of our connectors which use an API
You can utilize the above method mentioned in ‘No Browse Tools’ however at the end of your workflow you can output directly to a database using one of our connectors via an API. As Alteryx and Tableau work together, often clients will use the Publish to Tableau Server Macro to take data directly from an Alteryx workflow and up to tableau without keeping any data in memory. Again the 1MB of cached data will be removed from the Alteryx engine files and Alteryx pushes the data via a ‘POST’ command directly to the Tableau Server.
Any tool that performs a Sort will increase processing time. Many tools in Alteryx utilize a sorting process as a facet of their inherent functionality or if a group by option is used. The Summarize tool does this, for example. Alteryx has the ability to remember Sorts, or said another way, it knows if data has already been sorted upstream. This is a behavior that you can take advantage of as yet another means of optimizing your module’s processing. In the example shown, the process in Green is more efficient, as sorting will only be performed on the data once. For the process in Red, sorting will have to be performed twice.
Tools that Sort
Tools that use Sorts:
Tools use Temp Files and/or Internal Sorts:
Block Until Done
Multi Row Formula
Distance (Guzzler Only)
Trade Area (Guzzler Only or Non-Overlapping)
Run Command (Temp files only)
Use Temp Files and/or Internal Sorts at the Process level:
Batch Macros (for secondary inputs)
Iterative Macros (for secondary inputs)
Behavior Create Profile
When bringing data into Alteryx a lot of users often add a select tool to check the data type and structure. Data types are very important because of the available operations/functions in tools can be adjusted to fit the data type being used!
You'll notice some of the menus have been rearranged in version 10+. So if you are looking to export a workflow package, you will find that under the Options menu, Export Workflow . By default, you will see all the assets that the workflow depends on to run properly, but sometimes you will want to include additional files to package with the workflow.
Analytic Starter Kits teach you how to prep, blend, and analyze your data faster than using tools like Excel or SQL coding; guide you through how to build sophisticated insights such as market basket analysis or predictive A/B testing; and show you how to output beautiful Tableau , Microsoft Power BI , Qlik or Salesforce Wave visualizations to explore and understand your data and insights.
Copy: Data can be copied directly from a Browse tool, just highlight the cells by clicking and dragging, much like you would in an old-fashioned spreadsheet. In the top right hand corner of the Browse tool’s Properties, click on the “Copy to Clipboard” button, four choices will appear. For this demonstration, we have chosen “Selected Cells with Headers”, the data is now copied to our clipboard.
Paste: To paste the data into a module, try the technique mentioned above. Simply right click on your canvas and click Paste. Bonus Thought: Suppose you are developing a large module, and you’ve employed Tool Containers as a part of your module’s design process. How might you use the above method of copy/pasting data to eliminate the need to Run the completed portions of your module, but still supply data to downstream processes that you are working on? Answer: Place a Browse at the output of a completed Tool Container process; copy the contents of the Browse and paste them onto the Canvas as a Text Input. Disable the Tool Container and use the Text Input as a substitute feed while you complete your module.
Truncated data is usually defined as numeric rounding or cut off, string shortening, or datum deletion - essentially any time information is lost. Since, as analysts, our insights are only as good as our data, we usually find ourselves trying to preserve the integrity of data as we’re processing it. That’s not to say we can’t also optimize our resources usage when our data will allow for it.
While keeping your data types as small as possible is important, and can serve to shorten run times, it is even more important to understand what data types are most accommodating to your data and in what situations they can be shortened without truncation. If you’ve ever seen Office Space, you understand just how important even fractions of a cent are – and accuracy doesn’t just apply to financial data.
To avoid truncation in your data you have to first explore your data types in the context of your analyses. What data types and sizes did you receive the data in and why? What format would we like to see our results in? Are there opportunities to reduce memory usage between those two? And finally - what operations will we perform on the fields in our workflows that may impact each data type and size? The answers to these questions will be unique to each dataset, but once they’re addressed you can use the same techniques to keep your data both optimized and accurate.
Start by identifying the data types that most closely fits your fields based on the questions above. If you want Alteryx’s best guess, try using the Auto Field Tool to assign optimized data types and sizes automatically. While this tool is immensely helpful, be sure to check that the output is not truncating data or leaving it in a form less conducive to your downstream analyses – the Auto Field Tool doesn’t know the answers to your questions above. You can have the best of both worlds by adjusting the assignments from the Auto Field Tool, where necessary, by placing a Select Tool (master it here) just afterwards. You’ll then have suggestions and be able to change the less accurate/accommodating assignments by hand. Some things to consider:
If performing string operations later in your workflow that may increase their length, pay close attention to see if they are being truncated after that maximum string value is reached. String and WString (accepts Unicode characters) types are set length and will drop any characters that exceed their size. On the other hand, V_string and V_WString (accepts Unicode characters) are of variable length, and will adjust to accommodate strings after assignment.
Numerics may seem the most intuitive of the bunch, but pay close, close attention to the precision of each type so as to avoid unintentional rounding. Fixed Decimal is the only type to have an adjustable length – the rest may force your data to fit if not assigned to the correct type.
Dates are not always input in the standard ISO format yyyy-mm-dd HH:MM:SS, and may require some converting in order to handle them as Date/Time types in the Designer (this is important when trying to sort or filter by date or apply Date/Time functions in formulas). Any other date format will be interpreted as a string, and treated as such, unless converted using the DateTimeParse() function, the DateTime Tool, the Parse Dates Macro, or the fan favorite BB Date Macro. If you need to return to your original formatting, or simply prefer another, you can always convert your dates back into another format string after they’ve been processed using the DateTimeFormat() function or the DateTime Tool.
How can I share my workflows if my recipients don’t have the same files they reference? Sending them all over separately and reconfiguring inputs separately seems like a lot of work. Halp!
You’re right, that does sound like a lot of work. Luckily, we have an export feature to help with this exact scenario:
Navigating in your Designer to Options >> Export Workflow will open a menu where you can include assets that are referenced in your workflow and export them into a single .yxzp file to be shared. Feel free to select whichever files you’d like to include – if a file is missing, try attaching an additional asset/file to the tool it’s associated with. At that point it should appear in your export options:
Note: If your workflow uses a database connection, your recipient will need to either 1) have a DSN set up (system preferred) for the connection string used or 2) use the same alias to establish a connection in their environment.
After you select the files to be included, select “Browse” to specify the location and name of the export:
You should be all set! Share this export with your files included.
To open an exported workflow, simply File >> Open Workflow >> Browse in Alteryx or double click the green .yxzp file like any other Alteryx file type:
You should then see a prompt stating that the file is an Alteryx Package; just hit “Yes” to begin the import:
You’ll then be given the option to change the directory that the export is extracted to; below that option you will see the exported files listed with their locations relative to the destination directory:
Select “Import” to proceed:
Once the export has successfully extracted, you will be given a notification that the import process has completed. Select “Yes” and your exported workflow should already be loaded and ready to run!
Where can I find a vailable "Big Data Sets" over the internet?
Big data is data that is usually with sizes beyond the ability of commonly used software tools to manage and process within a tolerable elapsed time. A year-long credit card transaction history or CDR (Call data record) of a telecoms company for the last 9 months, behavioral credit data of a large financial institution are some examples...
Amazon (AWS) has a Large Data Sets Repository
Data.gov has close to 190k public data sets
One of the standard datasets for Hadoop is the Enron email dataset comprising emails between Enron employees during the scandal. It's a great practice dataset for dealing with semi-structured data (file scraping, regexes, parsing, joining, etc.). It's ~400MB (compressed) and available for download at http://www.cs.cmu.edu/~enron/
Collection of audio features and metadata for a million contemporary popular music tracks http://labrosa.ee.columbia.edu/millionsong/ . SecondHandSongs dataset -> cover songs musiXmatch dataset -> lyrics Last.fm dataset -> song-level tags and similarity Taste Profile subset -> user data thisismyjam-to-MSD mapping -> more user data tagtraum genre annotations -> genre labels Top MAGD dataset -> more genre labels You can either download the entire dataset (280 GB) or a subset of 10,000 songs (1.8) for a quick taste.
GDELT set: http://www.gdeltproject.org/data.html
NY City taxi data sets 1.1BN records: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data
Airline data set 1987-2008: https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O
Google Big Table - hosted by Google: https://cloud.google.com/bigquery/sample-tables Weather, timeline of actions such as pull requests and comments on GitHub repositories with a nested or flat schema, US births 1969-2008, Shakespeare - number of times each word appears, Wikipedia articles over 300,000,000 million rows.
LENDING CLUB: https://www.lendingclub.com/info/download-data.action
Here is a telecom Italia dataset as a result of a computation over the Call Detail Records (CDRs) generated by the Telecom Italia cellular network over the city of Milano; You may have to sign-in and activate your account but it's totally free... https://dandelion.eu/datagems/SpazioDati/telecom-sms-call-internet-mi/description/
Data Science Centralhttp://www.datasciencecentral.com/profiles/blogs/big-data-sets-available-for-free
KD Nuggets is a well respected analytics blog, they have put together a very nice and deep list: http://www.kdnuggets.com/datasets/index.html
UK Data https://data.gov.uk/data
Google's Public Data Directory: http://www.google.com/publicdata/directory
For the Spatial and GIS folks: http://gisgeography.com/best-free-gis-data-sources-raster-vector/
The mother of big datasets - Reddit. 1.7bn JSON objects; 250GB compressed. https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment
Loads of really great links from here as well: https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
http://www.imdb.com/interfaces A subset of the IMDb plain text data files is available from their FTP sites as follows: ftp.fu-berlin.de (Germany) ftp.funet.fi (Finland)
One of my favorites are Data.gov where there is tons of public data from all sectors, different size sets and in different formats including API connections. This url, http://www.data.gov/open-gov/ , shows each of the local governments in the US. They have varying degrees of completion on the local level.
The Government of Canada has an Open Data portal -- http://open.canada.ca/en/open-data -- it takes some digging to find the gems, but there are some.
There's also some open mapping data at -- http://open.canada.ca/en/open-maps.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open-source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions. https://cloud.google.com/bigquery/public-data/github
15.49TB of research data available. http://academictorrents.com/
Australia, New South Wales Open data http://data.nsw.gov.au/
USAFacts: Our Nation, in numbers. Federal, state, and local data from over 70 government sources.