Engine Works

clmc9601 · ‎07-25-2024

I have aggregated tips and resources for learning more about the strategies of generalizability, scalability, and maintainability. There’s a key question that you can use to analyze your everyday workflows through these perspectives. Without further ado, let’s cover scalability.

Scalability

Also known as: performance, optimization, efficiency, cost

Key Question

How would this workflow perform with more data?

Tips & Considerations

Below are several crucial considerations for workflow- and organization-level scalability. Although there are specific tool-based tips later in the article, these will likely make a larger impact on your overall data processing performance.

Focus on the delta

In mathematics and data, “delta” refers to the change or variance between two datasets. When you source your data, do you really need to input the entire dataset and reload it each time you click run? Processing only the incremental records is a key to scalability. (Thanks to @SeanAdams for this tip!) This is conceptually described in this article written for Designer Cloud.

The “incremental records” are any new records that have come in since your last run, any changed records, and evidence of any deleted records. See if you can access an [Updated Date] field in your system data or add one to your data storage process.

If you can limit your SQL inputs based on an [Updated Date] field, that makes it easy to limit your inputs to the delta. If you are processing files, add a filter early in the process to remove unchanged records.

Regardless of whether you can identify the delta during or after input, outputting and loading only the delta will make your process significantly more scalable. If you want better delta processing in Designer, please like and comment on this idea by @SeanAdams.

Use an automation-friendly format

As you likely have observed, some formats of data are easier than others to process in Alteryx and other database tools. Many layouts that are visually appealing are challenging to parse. These may require extra pivoting, removing nulls, and more. If you have influence over the shape of your starting and ending datasets, you can greatly increase the ease of generalizability and scalability within your workflows.

Here are a few tips for automation-friendly data inputs and outputs:

Focus on data in rows rather than in columns. This is the concept of “long” (rows) vs “wide” (columns) datasets, as introduced well here. Both formats have their advantages and use cases.

Base conditions on values in your data, not on the column names themselves. For example, consider the following tables of the same data presented in two ways:

In the right table, the day of week values are available in fields, not just in the column names. The right table is a better example of an automation-friendly format.

It also ties to the generalizability concept of conditional instead of positional logic. In the left example, your day of the week is only available in the column name, so you have to refer to each column name separately. In the right example, your day of the week is available as a value to use in conditions.

Remove any unnecessary columns and rows at the beginning of your workflow and throughout.

Be aware of data types, especially when writing to a database. Spatial data scales poorly.

Leverage databases

Generally, it’s a best practice to store data in databases rather than in large quantities of local files or manual workpapers in Excel.

Consider the financial cost of cloud querying and storage (Snowflake article, GCP article, Azure article).

Make use of InDB tools and/or SQL WHERE statements to load the smallest amount of data needed.

Consider scope for scalability

From the beginning, think through how you will use this workflow. Will you be using it on Desktop Designer? Will you be scheduling it in Server/Gallery? How much memory, temp storage, and how many cores does your machine/server have? How about the smallest machine on which this workflow will be run? Scaling the workflow on a single low-core machine will be different than scaling it for an enterprise server environment. (Thanks to @grossal for this perspective!)

These questions should guide how you optimize your workflow, especially regarding the decision for AMP (Engine 2) vs E1. AMP uses several cores that are available across your machine. That being said, if you have a large number of cores, AMP will not make efficient use of all of the cores. On Server, it usually makes more sense to run workflows without AMP due to the large number of cores available. Some features are AMP-only (Control Containers, SDK macros, etc.), so be very careful about whether you use these features if you have a performance-intensive workflow you intend to run on Server.

Optimize expensive tools

Performance Profiling is a useful feature in Alteryx Designer that shows you the percentage of your total runtime used by each tool in your workflow. “Performance-intensive” or “expensive” tools are the tools that hog your runtime, especially as your number of rows increases.

The best ways to increase scalability with these tools are to consolidate them wherever possible and to process the least amount of data through each. Do you really need six joins, or can you accomplish the same task with three joins simply by reordering them? Do you really need to append those fields to the entire dataset, or can you append them right at the very end after you have reduced your dataset to 10% of its original size? When your dataset is large, each additional tool you add will increase the runtime.

These tools are known to be performance-intensive, so use them with care:

Join
Append Fields
Batch and iterative macros
All spatial tools and functions
Regex tool and functions
CASS/geocode

The following section will give more specific guidance for optimizing tools that scale poorly.

Helpful Tools

Some tools in Alteryx Designer are notorious for scaling poorly. I’ve provided below my favorite alternatives and workarounds for tools that scale poorly.

Browse → Sample + Output Data

The Browse tool should be used only for data exploration, in my opinion. Did you know that every time you use a Browse tool in Designer Desktop, you are actually writing a .yxdb file to the temp drive? This is immaterial at a small scale, but once your data is larger than 100 MB or so, using Browse tools will start to take a performance toll on your workflow and hard drive. Unless you genuinely need to see every cell in your data, I would recommend taking a value-based (Filter tool) or size-based (Sample tool) extract of your data and writing to a file using an Output tool.

Once this workflow is finalized and moves to production, all of these exploratory tools should be deleted.

Crosstab & Transpose → as few rows/columns as possible, or restructure

Crosstab and Transpose are powerful for pivoting data, but they are known to scale poorly. I personally prefer to restructure my workflow to avoid using these tools, especially Crosstab, unless absolutely necessary.

In order to improve scalability, I would recommend processing as few rows and columns as possible through these tools. For rows, this can mean adding a Summarize tool before to consolidate rows into the fewest possible instances. For columns, this can mean deselecting all unnecessary columns in both the Key Columns and Data Columns sections.

Find Replace → Join or Append Fields + Formula

The Find Replace tool will produce a warning if you attempt to process more than 100,000 rows. This warning encourages you to use a join instead for larger data. Your replacement tools will depend on the configuration of your original Find Replace tool.

○ If you were appending fields based on an exact and entire string value, you can use Join (recommended) or Append Fields

○ If it was case sensitive, add a Formula before both Join anchors to create a separate field with standardized case using the Uppercase() or Lowercase() functions.

○ If it was based on part of a string, try joining based on a standard join key and then adding a Filter or Formula to evaluate the string partial match.

○ If you were replacing instead of appending, add a Formula with Replace() or Regex_Replace() to do the replacement.

Auto Field → Select

The Auto Field tool should be used exclusively for data exploration, in my opinion. It is notoriously slow because it analyzes every cell of every column to find the smallest datatype possible. I would never recommend this tool for production workflows, and I personally don’t use it at all.

In your production workflows, you likely should take more intentional control of your data types. I would recommend using a Select tool to set your data types, especially upon output. If you need a more dynamic approach, you could consider Multi-Field Formula or even various workarounds with CReW Dynamic Metadata to take metadata from a second input.

It’s also important to consider how your data types might change over time. For example, say you have a field that currently contains US zip codes of 5 digits. What if, in the future, you need the greater granularity that comes with a 9-digit zip code? What if you expand to Canada or other international zip codes? Carefully consider the sizes of your columns and whether it makes sense to allow them extra length or character width now so you don’t have to modify this manually in the future.

Data Cleansing → CReW Cleanse or Multi-Field Formula

The Data Cleansing macro, especially in older versions of Alteryx, runs inefficiently. Data is processed through tools even when the settings are disabled. The CReW Cleanse tool is a highly optimized version of this tool with increased functionality, like parsing HTML tags. Comparing these two tools on increasingly large data sets is a great lesson in building scalable workflows. The CReW macro, like most macros, is set up to process data through the minimum number of tools required while balancing maintainability and generalizability.

If you only need a single feature from the Data Cleansing tool, an even lighter, more scalable approach is to just apply the one or two features you need using a Multi-Field Formula tool. Did you know there are formula functions for changing casing, removing certain characters (with or without regex), and more? The Multi-Field Formula also has a two-for-one benefit where it can apply an expression and also update data types of columns.

Batch and Iterative Macros → as few batches/iterations and small incoming data as possible, or restructure

Batch and Iterative macros both can only process one iteration at a time, even using the AMP Engine. The implication is that regardless of your data size, these macros will still only process one row at a time. As you can imagine, this makes them very slow with large sets of data.

If you have a particularly large or growing dataset, my best recommendation is to avoid using batch and iterative macros in the large-data sections of the workflow. See if you can restructure to make the same calculations happen without a macro. Note that you’ll have to compare the importance of generalizability in this scenario because frequently batch macros allow for significant generalizability.

If you need the batch or iterative macro, I recommend optimizing for the smallest number of batches and the smallest number of records processing through each batch. Try using a summarize tool on your data before the join to process only unique rows—then you can join back the results to your main dataset. Experiment to see what is the most performant for your workflow and your machine. There’s no single right answer, and that’s part of the fun of implementing workflow strategy!

Large appends and cartesian joins → put proper safeguards in place

The Append Fields tool has a control in place for a reason. This default configuration protects against unintentionally processing appends of more than 16 records in the Source anchor. Appends of this size can massively slow down your workflow. While stopping these cross multiplications when they arise is helpful, I personally prefer to prevent situations like this from occurring in my workflow.

My first recommendation would be to take care of the structure of your workflow. Do you truly need to append 16+ rows? If so, make sure the larger dataset is in the Target anchor. Make sure you have controls in place on your Source anchor and limit it to truly the fewest number of records you need to append. Keep in mind that this is creating a pure cross-multiplication of your records.

With the regular join tool, there is no such safeguard against cartesian joins (cross joins). Be very careful about what data is contained in your join keys. I recommend deduplicating your join keys prior to processing data through the join. This is a strategic move to prevent cross joins. If you would prefer having a configuration to error upon cross multiplication, the Join Multiple tool has a configuration for this exact purpose. This is especially important for joins of more than two inputs due to the compounding nature of cross joins. In other words, it’s much easier for your data to explode when it has multiple inputs.

Regex → String functions and/or Text to Columns

Regex is very powerful, but with great power tends to come higher performance costs. This applies to both the Regex tool and regex functions in any expression editor. Using string functions instead of regex will greatly increase your workflow’s runtime when you are working with large volumes of data.

For casual regex users doing regex parsing and replacement, you can likely accomplish this same task using a creative combination of string functions. Some of my new favorite string functions come from versions 23.2 and 24.1! Check out GetPart() and GetWord() especially. If you’re parsing filepaths, I’d highly recommend leveraging the File functions.

For tokenize functionality, I usually add a single-character delimiter like a pipe “|”, then use the Text to Columns tool to split on that delimiter. It’s helpful for me to think about tokenize and Text to Columns as inverses. Tokenize keeps only the pattern specified, and Text to Columns splits on the individual character(s) specified. By surrounding my desired tokenize pattern with delimiters, I can use Text to Columns to accomplish the same goal and provide more visibility into where the splits will happen.

Keep in mind that splitting to rows can be performance-intensive as well, especially if your pre-split dataset is large and has lots of columns.

If you must use regex, I have one final note: in my opinion, the best regex is simple regex. This is for maintainability and generalizability reasons. My most commonly used regex expression is the following: regex_replace([Field], ‘\D’, ‘’), which removes all non-digit characters. I generally find that it’s easier to write dynamic regex with expressions that remove characters (replacing certain strings with empty strings) instead of replacements or capture groups. The simpler the pattern, the better.

Conclusion

Incorporating the principle of scalability into your workflow will keep it running smoothly and quickly even as your data grows. Optimizing and saving processing time can literally mean saving your company money.

You likely can even calculate the ROI on the time you spend learning about and implementing scalability! For example, by combining the cost per workflow-run-minute on Server or depreciation by computer-processing-minute on Desktop, cloud compute costs of each query, the quantity of runtime minutes saved by your optimization, and any other costs you consider relevant. If you end up calculating this ROI, I would love to hear about it!

Determining when to use scalability is a judgment call. It is possible to spend too much time optimizing. I recommend focusing your efforts on workflows that are the most important, are the most frequent to run, take the longest to run, and/or are the most expensive to run (perhaps by the cloud computing costs).

Also, check out the workflow strategies of generalizability (coming soon) and maintainability (coming soon). These strategies are best utilized together! To return to the larger article containing practical tips to advance from user to superuser, click here.