Engine Works

DavidHa · ‎10-22-2019

Disclaimers

Although the spelling 'queuing' is also acceptable, this article will use the spelling 'queueing' because of its usage in academic circles, and more importantly, because it has 5 consecutive vowels. Leave a comment below if you know of another!
This is a looong article. Some may even think Queueing Theory is a "slow read." It's recommended at this point to clear your calendar, go grab a cup of strong coffee, and ask your admin to hold all phone calls. Okay, maybe that last one doesn't apply, but we can dream, right?

The Queued Jobs Problem

Alteryx Server environments typically start out being used by a single department or a small group of users. But as word of mouth spreads within the organization about all the amazing ways Alteryx can be used to ingest, transform, enrich, and analyze data, more users are added to the Server. More users means an increase in job submissions, which eventually leads to jobs waiting in a queued state for their chance to run.

An Alteryx Server job queue with 2 jobs running and 8 jobs waiting to run.

Perhaps this high queued job state is only seen for a brief period of time when many users submit or schedule their jobs to run around the same time frames, (ex: 8:00 am, lunch hour, etc…). But I've talked to many customers whose Server stays in this state all day long. If a Server remains in this condition, there are several factors to consider that can help reduce the number of queued jobs. These factors can be summarized as:

1. Reduce job run times.
In the above example, we have two running jobs. The quicker those two jobs complete, the sooner two queued jobs can transition into the running state, and continue this process cycling through the queue. Some points to consider:

Scheduling - Try and schedule long-running jobs in off-hours. A 2-hour long job running overnight might be less of an impact than running during peak business hours.
Profiling - The best way to understand why jobs are taking a long time to run, is to execute them using the Alteryx Designer that is installed on the Alteryx Server machine with Performance Profiling enabled. (If you have a multi-node environment you should use the Designer on one of the Workers). The Performance Profiling feature allows you to see how many seconds were spent processing each tool in the workflow.

Most likely a Tool or set of Tools will be the main culprit, and you can focus on optimizing it to reducing your job run times. In the example above, 98% of the workflow execution time (23 seconds) was spent reading data from Hadoop, which is the next point…

Data Connections - In many cases, reading data from the source system will be a large majority of the time. If possible, use the In-Database tools to eliminate data movement, or work with your IT group to ensure you have the optimal data connection between the data source and the Alteryx Server environment.
Hardware - If tools like Sorts, Joins, Spatial, or Predictive tools are the bottleneck then it's time to look at upgrading the server hardware. These tools need as much memory and CPU as possible to efficiently process large data sets. Increasing the amount of RAM and CPU cores may provide reductions in job run times.

2. Increase the number of concurrently running jobs.
Our recommended starting point for the setting Workflows Allowed to run Simultaneously is half of the physical cores. So for customers running on the minimum system requirements of a single machine with 4 cores, the initial recommended value for Workflows Allowed to run Simultaneously would be 2. (Note, this is a starting point recommendation. There are factors that may warrant a lower or higher value, but 1/2 the physical cores is a good place to start out.)

In the original example with 8 queued jobs, if we had a second 4-core Worker with 2 Workflows allowed to run simultaneously, giving us a total of 4 jobs running concurrently instead of only 2, we would obviously cycle through the queue much faster. But how do we know how many concurrently running jobs we need? This is where Queueing Theory comes in and is the focus of this article.

Queueing Theory Primer

Queueing Theory is the study of queues, used to predict how long queues will be (queue length) or the duration that items will be in the queue (queue time). Organizations' operations teams use queueing theory for many reasons, such as predicting response times or determining the number of resources needed to provide a service. For example, how many customer support associates are needed to answer calls from customers. Let's look at this example with some key queueing theory terms added:

Arrivals are the customers actively calling into support for assistance. Once admitted, they are added to the queue and the queue length grows. As a customer support associate becomes available, the customer is serviced and the queue length decreases. The amount of time the customer spent in the queue is the queue time. The service time is the measure of how long it takes for the customer support associate to help resolve the customer's problem before they depart. The total time the customer spent in the system is the sum of the Queue time + Service time.

In reality, we all apply queueing theory every day. When you are in rush hour traffic and your lane seems to be moving much slower, you discern that the lane beside you has a higher service rate so you change lanes. Or when you are checking out at the grocery store and you see two lines with equal queue lengths. But one has a customer with a cart overflowing and they're holding a 3-ring binder full of coupons. You know that line will have a high service time, so you choose the other line.

In Queueing Theory, the most commonly used formula is Little's Law, which states that the average number of customers in a system (L) is equal to the arrival rate of customers (λ) multiplied by the average time a customer spends in the system (W).

As a simple example, if customers arrive at a store at a rate of 100 per hour, and they stay for an average of 30 minutes, then we should find approximately 50 customers in the store at any given time.

L = 100 x (30/60) = 50

The law can also be applied to sub-systems within the store, for example, the checkout line. If customers arrive at the checkout line at a rate of 50 per hour, and it takes them approximately 10 minutes from the time they enter the line to the time they are finished checking out, then we should find approximately 8 people standing in the checkout line at any given time.

L = 50 x (10/60) = 8.3333

This law holds true in all systems. Banks, Grocery Stores, Network switches, Message-passing systems, and even, Alteryx Server.

The M/M/c Queue

Queueing theory gets even more interesting when we start looking at the different types of queueing models. There are models for single queues, multiple queues, single servers handling queued requests, multiple servers handling queued requests, incoming requests arriving at fixed intervals, random intervals, etc… These models are almost always described in Kendall's notation, which follows the format of A/S/c.

A describes the time between arrivals to the queue.
S is the distribution of service times.
c is the capacity, or number of servers handling queued requests.

Alteryx Server matches the M/M/c queueing model which can be described as follows:

1st M - Jobs (workflows) are added to a single queue at random times (matching a Poisson process). Yes, there are fixed intervals for reoccurring scheduled jobs, but the high presence of on-demand jobs submitted by users in the Gallery make arrivals closely follow a Poisson process.
2nd M - Job service times follow an exponential distribution. In most Alteryx Server environments the majority of jobs execute in a short amount of time. Some jobs take a bit longer, while there are usually very few long running jobs. The result is an exponential distribution of job runtimes. This can be verified by looking at the output of an Alteryx Server Usage Report which shows us when jobs were scheduled, started, and completed. The difference between the Started and Completed time is the job runtime.

When charted we can observe the job runtimes follow an exponential distribution.

c - An Alteryx Server environment could have any number of servers handling queued jobs. This 'capacity' number is the sum of the "Workflows allowed to run simultaneously" across all Workers.

Having identified that Alteryx Server matches an M/M/c queue, we can take advantage of proven equations to derive queue length, wait times, utilization, and more. All we need for these equations is:

Our job arrival rate (λ). This can be jobs per hour based on standard business hours. Based on a 24-hour schedule. Whatever makes sense for your organization.
Our average job runtime (x) over the period used in #1.

Our Service Rate (µ) is calculated from the number of jobs we can run in 1 hour using the average job runtime (x) above. For example, if our average job runtime (x) is 5 minutes, our Service Rate is 12.

µ = 60 / 5 = 12

(Note: hours and minutes do not have to be used here. Whatever units you decide to use, just make sure to standardize across both the arrival rate and service rate.)

With an arrival rate, service rate, and number of servers, we can predict the queueing behavior of an Alteryx Server. The most basic equation calculates the system utilization:

λ = the job arrival rate.
µ = the job service rate.
c = the 'capacity' or number of servers - the total number of simultaneous workflows across all Workers.
ρ = system utilization - the utilization of concurrently running Workflows.
(a value of 1 indicates that all 'c' Simultaneous Workflows would be busy processing jobs)

Example - Jobs arrive at a rate of 10 per hour, and our average job runtime is 5 minutes. We already calculated that our service rate is 12. The Alteryx Server environment is configured with 2 total simultaneous workflows.

ρ = 10 / (2 * 12) = 0.4167

This tells us on average our two simultaneous workflows will be busy roughly 41% of the time. A value > 1 indicates the system is not sized large enough to handle the workload since we cannot have utilization higher than 100%.

Knowing that system utilization (ρ) cannot be greater than 100%, we can determine the minimum number of simultaneous workflows required with the simplified equation:

Example - Using the same numbers from above:

c = 10 / 12 = 0.8333

We always need to take the ceiling of this value so at a minimum we would need 1 simultaneous workflow. In this case, a job would be running approximately 83% of the time, compared to previously where we had 2 simultaneous workflows and a job would be running roughly 41% of the time.

Cliffhanger

This is just the start, there are many other equations which can calculate the average number of queued jobs, how long the jobs will wait in the queue, the probability of queued jobs, and more. These equations are documented in academic papers such as this one, and available to play with online calculators such as this one. But, since my Caribou Blend is getting low, we'll pick up in a Part 2 article with a sample Workflow to calculate these values for us across a range of server (simultaneous workflows) values, so we can understand what types of queueing behavior we might see in our Alteryx Server given specified job arrival rates and job runtimes.

If you would like to have a discussion to ensure your Alteryx Server environment is sized optimally, please reach out to your Alteryx representative for an architecture review session and we'll be glad to support you.

In the meantime, Happy Alteryxing!