Engine Works

DavidHa · ‎03-17-2020

One of the biggest problems I see when working with Alteryx Server customers is helping them understand how many jobs can run at a time (# of Simultaneous Workflows) and what number makes sense for their environment. This setting is documented as "Workflows allowed to run simultaneously"

The default value is 1, meaning only 1 job can run at a time on that specific Worker.
The general recommendation has always been (# of Physical Cores) / 2.
So for a standard 4-core Server the recommended value would be 2.

The recommendation above is a starting point. There are more details about this setting with additional recommendations in the Worker System Settings Deep Dive article. What we find is that often times the value is set too low, jobs are queueing, and resources are being left on the table unused. Or alternatively, the value is set too high causing the system to become overloaded and even jobs to fail.

The following article is NOT a performance benchmarking paper. These are observations meant to show you the implications of this setting and to encourage you to perform your own testing and research with relevant workflows to understand the optimal # of Simultaneous Workflows for your environment.

The Environment

My laptop didn't make for the most realistic test environment, so I went to AWS and created two EC2 instances, each with 4 cores (8 vCPUs) and 16 GB of RAM. I then installed Alteryx Server 2019.4.4 and configured them per the diagram below, with one machine serving as the Controller & Gallery, and the other machine serving as a dedicated Worker. This allows us to configure the dedicated Worker machine to allocate all resources to running jobs.

A quick word of caution when working with AWS. An EC2 instance listed as 4 vCPU (such as the m4.xlarge) typically means it only has 2 cores. Alteryx has a 4-core minimum requirement so I went with the c5.2xlarge. Information on AWS physical cores can be found here.

We can see from this output that the c5.2xlarge has 4 Cores with 8 "Logical Processors" or threads.

No tuning was performed. All Alteryx system settings were kept default with the exception of the Logging Level was set to Normal as opposed to High. The Worker setting "Workflows allowed to run simultaneously" and the Engine setting "Default Sort/Join Memory Usage" were modified appropriately for each test, as described in the test section.

The Workflows

As mentioned in the introduction, this type of analysis only works if relevant workflows are used. So as a simple test, I used three different workflows to simulate various workflow patterns. Prep & Blend, Spatial, and Predictive.

Workflow #1 - Prep & Blend

The Prep & Blend workflow is a familiar one that joins two data sets then sorts and summarizes the output.

This type of workflow is of particular interest since the Join, Summarize, and Sort tools require all data to be read in to process, meaning these types of workflows can consume large amounts of memory, and potentially a lot of disk I/O to the Engine Temp directory (swapping) if the memory needed is more than the Sort/Join memory setting. How much memory is needed can be roughly determined from running the workflow in Designer and observing the largest value displayed:

Designer can give you a pretty good estimate of the max memory consumption a workflow could use. Designer can give you a pretty good estimate of the max memory consumption a workflow could use.

Workflow #2 - Spatial

The Spatial workflow uses some of the Spatial tools which can be CPU intensive.

Workflow #3 - Predictive

The Predictive workflow uses the R-based Predictive tools to build two models (Logistic Regression and Boosted), then uses the Model Comparison tool to determine the champion model.

The R-based Predictive tools are an interesting case since they launch additional processes outside of the Alteryx Engine process. These additional processes can consume extra CPU resources and Memory beyond any limits applied to the Engine via the Sort/Join Memory or Number of Threads settings.

Here's an example to illustrate this, where I had 3 Predictive workflows all running concurrently. Each has a corresponding Rterm and Rscript process. The Rscript processes are consuming 36% of the CPU and 4.4 GB of Memory.

R processes.png

The Test

The test was to observe the Average Workflow Execution Time, and the Time to Complete All 60 Workflows as the number of Simultaneous Workflows is increased. Why 60? Two reasons:

It's divisible by 1, 2, 3, 4, 5, and 6. This means if Simultaneous Workflows (N) is 3, then for the entire duration of the test there would be 3 workflows running concurrently. There's no remainder at the end less than N. The reason this is important is that for Simultaneous Workflows ranging from 1 to 6 all the results are all comparable.
60 jobs provide enough results that I can achieve a very consistent, stable, repeatable workflow execution time and eliminate any variability.

To get 60 workflow executions with equal weighting from the Prep & Blend, Spatial, and Predictive workflow types, I queued each one then looped and repeated for a total of 20 iterations. All jobs were added to the queue at once with automation. So the queue looked like this...

Predictive - job 20

Spatial - job 20

PrepBlend - job 20

...

Predictive - job 2

Spatial - job 2

PrepBlend - job 2

Predictive - job 1

Spatial - job 1

PrepBlend - job 1

At each Simultaneous Workflow count, I configured the Engine Sort/Join Memory setting based on the following recommended equation for a dedicated Worker, which is covered in extensively in the Engine System Settings Deep Dive article.

The Total amount of RAM can be found via this Windows command:

So for the c5.2xlarge, the recommended Sort/Join Memory setting values would be:


Simultaneous Workflows	1	2	3	4	5	6
Sort/Join Memory (MB)	11,704	5,852	3,901	2,926	2,341	1,951

The Results

All results have been normalized to the Simultaneous Workflow = 1 result.

Exec Time.png

Total Time.png

Observations

We can see that when I increased the number of Simultaneous Workflows to 2, the Average Workflow Execution Time increased by almost 2%. However, the Time to Complete All 60 Workflows was reduced almost in half since there were two jobs running in parallel compared to one before. I'd call this a win.
When I increased Simultaneous Workflows to 3, the Average Workflow Execution Time increased pretty extensively. This is mostly due to the fact that our Prep & Blend workflow that needs almost 10 GB of memory to efficiently run only has access to 3.9 GB. However, even with that increase in execution time, the Time to Complete All Workflows reduced another 12% thanks to the parallelization of a 3rd running job. Still winning.
At 4 Simultaneous Workflows we can see the Average Workflow Execution Time continues to rise, however, there is no longer a reduction in the Time to Complete All Workflows. We actually see a slight increase in the Time to Complete All Workflows. At this point, we've hit "the wall".
This increase continues again at 5 Simultaneous Workflows and is our clue we should stop now. But press on we must...
At 6 Simultaneous Workflows both the Average Workflow Execution Time and the Time to Complete All Workflows drastically increase. This is partly due to the fact that there's 2 of each workflow type running, which means 2 sets of additional R processes supporting the Predictive jobs. At this point, the machine is completely saturated as there's more work to do than threads available to run and not enough memory to support the memory-intensive Prep & Blend and Predictive workflows.

The results show that for THIS environment, and THESE workflows, 3 Simultaneous Workflows was the most efficient for overall throughput. Going beyond that produces diminishing returns, overloads the systems' resources, and makes individual job execution times longer than necessary.

Conclusion

What is clear from these results is that increasing the # of Simultaneous Workflows from the default value of 1 MAY increase the total amount of jobs that a Server can execute in a period of time (throughput). It will however likely come at the cost of individual workflow execution times being longer compared to if only one job was running. That is a trade-off that must be understood and evaluated. Setting the value too high can lead to a scenario where overall throughput is reduced and individual job execution times are increased to the point they run much longer than necessary.

The results show that for this environment and workload, the default recommendation for # of Simultaneous Workflows = "(# of Physical Cores) / 2" would be a great starting point. It's possible for a dedicated Worker like this, perhaps ((# of Physical Cores) / 2) + 1 is reasonable as well.

The important take away here is that each organizations' environment, workflows, and data sizes will vary and that conducting your own research, evaluations, and analysis will lead to the environment that works best for you!