Engine Works

stevea · ‎12-07-2015

Motivation

A question often asked by clients evaluating Server is how "big" of a machine they need to correctly host a Private Server instance. Since we’re software developers, our answer (of course) is “it depends,” and we provide guidelines for sizing the Server based on the expected number of concurrent users, the job run frequency, and average job size. These general guidelines for Server sizing are great for establishing an instance, but, as with shoes, optimally configuring a Server for maximum throughput is not a one-size-fits-all problem.

As load increases due to more demanding workflows and/or more users, Server customers will often return to us asking for advice about how to scale the instance to handle increased load. Once again, we tend to respond “it depends” (anyone notice a pattern here?), and provide some guidelines based on machine capacity, expected load, and how Alteryx itself (the “Engine”) consumes computer resources. The Engine can be a resource-hungry beast, and feeding it correctly is critical to optimizing Server performance.

It's also important to realize that the workflows themselves introduce significant uncertainty into this sizing puzzle. A workflow that runs quickly with one set of input data or parameters may run slowly with different data or different parameters.

Thus, scaling a Server correctly involves measuring performance using realistic workflows, and striking a balance between desired throughput, available hardware, and the Engine’s need for system resources.

Goals

My primary goal with this post is to demonstrate conceptually how tuning one aspect of the Server, the Queue Workers, may have a significant impact on total throughput, as measured by the total number of successfully completed jobs over some length of time. Each Queue Worker represents a single instance of the Engine, therefore the number of Queue Workers on a machine is the total number of Engine instances that can run simultaneously. This single configuration option in the System Settings dialog (Worker Configuration) can make or break a Server instance:

Scaling a Server generally involves increasing the total Queue Worker (Engine) capacity, and there are two ways to achieve this goal:

Scaling "up" by increasing the number of logical Queue Workers
Scaling "out" by adding one or more physical Queue Workers

We’ll look at both options, using metrics to draw general conclusions about each one. It’s important to note that we’re most concerned with observing trends, so we won’t dwell too much on the details of each test run.

Caveat

Before continuing, it’s important to recognize that this post is intended as a demonstration of how various configurations might impact Server performance (positively or negatively). Clearly, your mileage will vary, but the general themes presented here can be used as guidelines for further exploration and tuning of your configuration.

That’s all just developer-speak for “it depends,” so let’s move on and get to the interesting bits.

Measuring Performance

A standard, 64-bit Alteryx 10.1 installation running on commodity Dell hardware generated the performance results presented here. The tuning workflow itself queues two test workflows on a Private Server, monitors execution progress, and collects results. Each workflow is queued five times on the Server, thus creating an initial work queue with a depth of 10 jobs. Multiple test runs confirm that the results are reproducible albeit with minor variance (but again, please remember that the Prime Directive of this post is conceptualizing the numbers).

The test Alteryx workflows themselves are intended to run for approximately 4-5 minutes each on the primary test machine. The first workflow load_join_calgary_out.yxmd is designed to stress I/O and memory, with a large join feeding Calgary loaders that generate multi-gigabyte output datasets. The second workflow Waterfalls.yxmd is designed to be CPU-intensive, loading small input datasets but making extensive use of spatial tools.

For reference, example PDF output from two of the seven test cases presented in the post are attached. The Alteryx workflow used to drive the performance tests is a work in progress, and the initial release is attached here for reference.

Results

For those who don’t like reading a lot of text, here are key take-home results in a single chart. Each test run is labeled with the test number, the number of Queue Workers, and the disk type used for the Engine temporary directory (spinning or SSD):

The red represents the overall workflow throughput, measured as (workflows / runtime), and the blue represents the Engine throughput, measured as (workflows / engine_runtime). With both values, a larger value represents increased throughput and is therefore "better." Our ideal Server setup is one in which both workflow throughput and Engine throughput are both maximized, meaning we are processing the most workflows per minute with the least amount of Engine time. On a Server, the latter is critical, because the longer the Engine spends processing a particular workflow, the more time another workflow is waiting in the queue, waiting to be processed.

Each test case is expanded upon below.

Test 1: Server with sub-optimal Queue Worker settings

The base-case machine has roughly the same specifications as the Alteryx recommendation , with a four-core (8 logical cores) Intel I7 at 3.2GHz, 32GB RAM, two SSDs and one physical disk. Alteryx is installed on an SSD, and the Alteryx Core Data Bundle is installed on the spinning disk.

For this test, however, we’re ignoring the Alteryx-recommended configuration of two Queue Workers (number_of_physical_cores / 2), and instead using a single Queue Worker. With this configuration, the baseline throughput is approximately 0.21 for both the Engine and the workflow throughput, meaning we can process about 0.21 workflows/minute:

This test case also demonstrates the minimal overhead of the test harness itself, as the total workflow processing time is roughly the same as the Engine processing time. The equivalent Engine and workflow throughput values also indicate that the machine is not taxed, and this setup represents our base "unloaded" case, where the Engine has full access to the machine's resources, with no sign of resource contention.

This machine can be pushed further, so let's do so.

Test 2: Server with recommended Queue Worker settings

For this test, let’s increase the number of Queue Workers from one to two, matching the starting configuration recommended by Alteryx (number_of_physical_cores / 2). The workflow throughput scales roughly linearly from the first test case, with a ~2x performance improvement, while the Engine throughput is roughly the same:

These results suggest that the two Engine processes are competing minimally with each other for system resources and the machine is well within its physical limits. The Engine throughput is roughly equivalent to the throughput measured in the first test case, where the Queue Worker had full access to the machine's resources, and the workflow throughput doubled with the addition of the extra logical Queue Worker, so this is a maximal setup.

The results also indicate that the suggested setup of (number_of_physical_cores / 2) for the Queue Worker count is a good place to start when setting up a Server.

Test 3: Add more workers!

At first glance, a machine with this much horsepower “should” be able to handle more load (right?), so let’s throw caution to the wind and run with four Queue Workers. This means one Alteryx Engine process per physical core.

As you can see, scaling “up” with twice the number of logical Queue Workers did decrease the total runtime over the two-worker case slightly. But we did so by increasing our server load significantly, reducing Engine throughput by 70%! The smaller and more CPU-bound test workflow ran at approximately the same speed as other tests, but the larger I/O bound ran more than 2x slower on average due to resource contention.

Test 4: Add more workers!

To drive home the point that scaling “up” may not necessarily be a great idea, let’s increase the number of Queue Workers again from four to six.
The total runtime is slightly slower than the two worker scenario, but the Engine time is now 2.5x slower. So, at the expense of tying up our Server and using more resources, we gained... nothing.

The resource contention between Engine processes is shown clearly in the Task Manager. There are seven AlteryxEngineCmd.exe processes, one of which (PID 9344) represents the monitoring workflow, and the other six of which represent the Queue Workers processing test workflows. Instead of working at full capacity, the six Queue Workers are starved for CPU:

On the plus side, we did consume more electricity during this test, and I’m grateful for the warmer office.

Test 5: Scale "out" by adding new physical Queue Worker

Let’s scale this instance “out” by adding a dedicated Queue Worker machine, a four core Intel I7 at 2.8GHz, 16GB RAM, with two drives (one SSD, one physical). Although it is significantly less capable than our base-case test machine, it still roughly fits our suggested minimum server specifications.
Alteryx is installed on the SSD, and the Alteryx Core Data Bundle is installed on the spinning disk. It is configured with 8gb sort/join memory, spinning disk for the Engine temp files, and two Queue Workers (as recommended, number_of_physical_cores / 2).

Even with the slow Queue Worker machine, this setup is ~25% faster than our single-machine, 2-worker case, which is a nice improvement:

With a dedicated worker of an equivalent, specification as the base-case machine, I would expect a roughly linear speedup for each new Queue Worker.

Test 6: Upgrade hardware

Each of the test cases presented thus far use a spinning disk for the Engine temporary file directory. Although the drives themselves are reasonably high-performance units, they still abide by the laws of physics, requiring physical movement to spin the drive and move the drive heads. With a disk-intensive workflow such as the Calgary example, a spinning disk may become a significant performance bottleneck, as each Engine process fights for the same, slow resource for temporary files as well as Input and Output tools. This type of performance bottleneck is known as resource starvation, and manifests itself as high disk demand and low CPU throughput:

Ouch, it hurts to see all that wasted CPU capacity

A relatively cheap computer upgrade is a Solid State Drive (SSD), which replaces moving parts with electronics, virtually eliminating drive latency and dramatically improve data throughput (see, for example, this Samsung whitepaper).

If we repeat a few of our tests with the Engine temporary space on the SSD, the performance improvement is somewhat remarkable, as the SSD speeds up all workflows, and allows multiple disk-intensive workflows to run concurrently at nearly at full speed. Tests (6) and (7) in this chart represent the 2 Queue Worker (test 2) and 4 Queue Worker (test 3) scenarios, respectively:

The take-home point here is to consider using SSDs on your Server if your workflows tend to be disk-intensive.

Discussion

Using these concrete test cases, we can draw some general conclusions about scaling out a Server. The results demonstrate that scaling a Server “up” by adding extra workers to an existing node may actually make the system slower as measured by total throughput. On the other hand, scaling a Server “out” by adding one or more dedicated Queue Worker makes the system faster as measured by total throughput, even if the additional hardware is considered to be somewhat inferior.

These results may seem counterintuitive, but they make sense when taking the Engine into consideration. The Engine is resource-intensive, especially with large datasets, and can tax a system more extensively than other less-demanding applications. Giving the Engine dedicated resources on a machine, meaning full access to CPUs, local hard drives and memory, can increase total throughput simply because there’s less resource contention between Engine processes.
Also, consider adding an SSD to your Alteryx machine, be it a laptop, workstation, or server. Putting commonly-used datasets onto the SSD (such as the Alteryx Core Data Bundle) as well as using it for the Engine temporary space may give a nice boost to your throughput.

Closing

I hope you enjoyed reading this post as much as I enjoyed composing it, and that you’re leaving with a better conceptual understanding of how to approach scaling a Server. Ultimately, your mileage will vary (remember, “it depends”!), so benchmark your Server using your workflows, and determine what works best for you.

And, as always, feel free to reach out to your contact at Alteryx for advice. We’re always happy to help.

Kudos to @Ned for feedback on the test and benchmark workflows, and @TaraM for feedback on the content.