Alteryx Server environments typically start out being used by a single department or a small group of users. But as word of mouth spreads within the organization about all the amazing ways Alteryx can be used to ingest, transform, enrich, and analyze data, more users are added to the Server. More users means an increase in job submissions, which eventually leads to jobs waiting in a queued state for their chance to run.
Perhaps this high queued job state is only seen for a brief period of time when many users submit or schedule their jobs to run around the same time frames, (ex: 8:00 am, lunch hour, etc…). But I've talked to many customers whose Server stays in this state all day long. If a Server remains in this condition, there are several factors to consider that can help reduce the number of queued jobs. These factors can be summarized as:
1. Reduce job run times.
In the above example, we have two running jobs. The quicker those two jobs complete, the sooner two queued jobs can transition into the running state, and continue this process cycling through the queue. Some points to consider:
Most likely a Tool or set of Tools will be the main culprit, and you can focus on optimizing it to reducing your job run times. In the example above, 98% of the workflow execution time (23 seconds) was spent reading data from Hadoop, which is the next point…
2. Increase the number of concurrently running jobs.
Our recommended starting point for the setting Workflows Allowed to run Simultaneously is half of the physical cores. So for customers running on the minimum system requirements of a single machine with 4 cores, the initial recommended value for Workflows Allowed to run Simultaneously would be 2. (Note, this is a starting point recommendation. There are factors that may warrant a lower or higher value, but 1/2 the physical cores is a good place to start out.)
In the original example with 8 queued jobs, if we had a second 4-core Worker with 2 Workflows allowed to run simultaneously, giving us a total of 4 jobs running concurrently instead of only 2, we would obviously cycle through the queue much faster. But how do we know how many concurrently running jobs we need? This is where Queueing Theory comes in and is the focus of this article.
Queueing Theory is the study of queues, used to predict how long queues will be (queue length) or the duration that items will be in the queue (queue time). Organizations' operations teams use queueing theory for many reasons, such as predicting response times or determining the number of resources needed to provide a service. For example, how many customer support associates are needed to answer calls from customers. Let's look at this example with some key queueing theory terms added:
Arrivals are the customers actively calling into support for assistance. Once admitted, they are added to the queue and the queue length grows. As a customer support associate becomes available, the customer is serviced and the queue length decreases. The amount of time the customer spent in the queue is the queue time. The service time is the measure of how long it takes for the customer support associate to help resolve the customer's problem before they depart. The total time the customer spent in the system is the sum of the Queue time + Service time.
In reality, we all apply queueing theory every day. When you are in rush hour traffic and your lane seems to be moving much slower, you discern that the lane beside you has a higher service rate so you change lanes. Or when you are checking out at the grocery store and you see two lines with equal queue lengths. But one has a customer with a cart overflowing and they're holding a 3-ring binder full of coupons. You know that line will have a high service time, so you choose the other line.
In Queueing Theory, the most commonly used formula is Little's Law, which states that the average number of customers in a system (L) is equal to the arrival rate of customers (λ) multiplied by the average time a customer spends in the system (W).
As a simple example, if customers arrive at a store at a rate of 100 per hour, and they stay for an average of 30 minutes, then we should find approximately 50 customers in the store at any given time.
L = 100 x (30/60) = 50
The law can also be applied to sub-systems within the store, for example, the checkout line. If customers arrive at the checkout line at a rate of 50 per hour, and it takes them approximately 10 minutes from the time they enter the line to the time they are finished checking out, then we should find approximately 8 people standing in the checkout line at any given time.
L = 50 x (10/60) = 8.3333
This law holds true in all systems. Banks, Grocery Stores, Network switches, Message-passing systems, and even, Alteryx Server.
Queueing theory gets even more interesting when we start looking at the different types of queueing models. There are models for single queues, multiple queues, single servers handling queued requests, multiple servers handling queued requests, incoming requests arriving at fixed intervals, random intervals, etc… These models are almost always described in Kendall's notation, which follows the format of A/S/c.
Alteryx Server matches the M/M/c queueing model which can be described as follows:
When charted we can observe the job runtimes follow an exponential distribution.
Having identified that Alteryx Server matches an M/M/c queue, we can take advantage of proven equations to derive queue length, wait times, utilization, and more. All we need for these equations is:
Our Service Rate (µ) is calculated from the number of jobs we can run in 1 hour using the average job runtime (x) above. For example, if our average job runtime (x) is 5 minutes, our Service Rate is 12.
µ = 60 / 5 = 12
(Note: hours and minutes do not have to be used here. Whatever units you decide to use, just make sure to standardize across both the arrival rate and service rate.)
With an arrival rate, service rate, and number of servers, we can predict the queueing behavior of an Alteryx Server. The most basic equation calculates the system utilization:
Example - Jobs arrive at a rate of 10 per hour, and our average job runtime is 5 minutes. We already calculated that our service rate is 12. The Alteryx Server environment is configured with 2 total simultaneous workflows.
ρ = 10 / (2 * 12) = 0.4167
This tells us on average our two simultaneous workflows will be busy roughly 41% of the time. A value > 1 indicates the system is not sized large enough to handle the workload since we cannot have utilization higher than 100%.
Knowing that system utilization (ρ) cannot be greater than 100%, we can determine the minimum number of simultaneous workflows required with the simplified equation:
Example - Using the same numbers from above:
c = 10 / 12 = 0.8333
We always need to take the ceiling of this value so at a minimum we would need 1 simultaneous workflow. In this case, a job would be running approximately 83% of the time, compared to previously where we had 2 simultaneous workflows and a job would be running roughly 41% of the time.
This is just the start, there are many other equations which can calculate the average number of queued jobs, how long the jobs will wait in the queue, the probability of queued jobs, and more. These equations are documented in academic papers such as this one, and available to play with online calculators such as this one. But, since my Caribou Blend is getting low, we'll pick up in a Part 2 article with a sample Workflow to calculate these values for us across a range of server (simultaneous workflows) values, so we can understand what types of queueing behavior we might see in our Alteryx Server given specified job arrival rates and job runtimes.
If you would like to have a discussion to ensure your Alteryx Server environment is sized optimally, please reach out to your Alteryx representative for an architecture review session and we'll be glad to support you.
In the meantime, Happy Alteryxing!
David has the privilege to lead the Alteryx Solutions Architecture team helping customers understand the Alteryx platform, how it integrates with their existing IT infrastructure and technology stack, and how Alteryx can provide high performance and advanced analytics. He's passionate about learning new technologies and recognizing how they can be leveraged to solve organizations' business problems.
David has the privilege to lead the Alteryx Solutions Architecture team helping customers understand the Alteryx platform, how it integrates with their existing IT infrastructure and technology stack, and how Alteryx can provide high performance and advanced analytics. He's passionate about learning new technologies and recognizing how they can be leveraged to solve organizations' business problems.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.