Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Server Discussions

Find answers, ask questions, and share expertise about Alteryx Server.
SOLVED

Scheduled jobs - Seeking advice on how the next job to be processed is decided

Paul_Holden
9 - Comet

Hi,

 

I am seeking to better understand the scheduler / queue management process on our Alteryx server.

 

Currently we are running 4 simultaneous workflows on an 8 core (total 8 vCPU) VM server with 32GB RAM

 

We have a significant spread in job execution times from a few seconds up to 6 hours (some workflows are hitting our specified execution time limit).

 

The specific behaviour that I do not understand is as follows:

 

We have a number of jobs scheduled to run every 10 minutes. These look for files in specific folders, process them if they are present or just end if there is nothing to do. Previously I'm seeing one of these schedules triggering in the early morning when several of our larger jobs are already queued or running. The new job is queued as I would expect.

 

What then happens is that the job continues to sit at Queued status for several hours whilst other jobs complete. As an example earlier this week I noted that a total or 133 different jobs were submitted by schedules (not just started executing but were actually created) and then completed before the "stuck" job was finally executed. At this point it has spent over 6 hours in the queue and took less than 2 seconds to execute as there was nothing to process.

 

We do not use schedule priority, all schedules are set with the default (?) low priority.

 

I do not understand why certain jobs are getting held up in this way.

 

The above describes the general picture I have seen since I began to look at the queue in detail a week or so ago but this morning I have checked the queue at about 9am and there are two pages of queued jobs with scheduled start times between 2am and and 9am but a job scheduled to run at 7:30am is currently running. So all of the other queued jobs have been skipped? Obviously I'm probably seeing more of this because I am now looking but I still need to understand WHY some jobs are being prioritised by the engine in this way as it is completely ruining the entire purpose of the regularly running schedules.

 

Versions
Client: 2018.3.51585
Server: 2018.3.4.51585
Server Binaries: 2018.3.4.51585
Service Layer
Master: 2018.3.4.51585

16 REPLIES 16
patrick_mcauliffe
14 - Magnetar
14 - Magnetar

Jobs are first in, first out.

I've seen something similar to this due to a problematic data connection in the workflow (usually an unstable network drive connection).

 

When was the last time your server was rebooted?  Windows servers, in my experience, need to have a regularly scheduled reboot.

 

Loic
Alteryx
Alteryx

It is first come first serve except if you use priorities and/or workflow/nodes affinity.

 

* The 4 simultaneous workflows are what's driving the size of your queue. It can only execute 4 workflows at a time. The recommended number is number of CPU/2. You might want to increase it to 5 to try if that improves your queue size. Do not go higher.

 

It might be that your queue is a bit too high at times and it just need time to process. You might need to scale up (add vCPUs) or scale out (add an additional worker node).

 

* priorities - you can use these as schedule option. it is a simple way to ensure that was is important goes through first when they are all in the queue at the same time. Customers usually define as High the workflows that have other downstream dependencies or where the business is waiting for it to run to be able to make decisions.

* workflow/nodes affinity: you need an additional worker node. Let's say the same as the one you have now. Now you can create worker flags (flag = w1 for worker1 and w2 for worker2 for instance). Adding the w1 or w2 flag to any schedule will make it run on that specific node. You could for instance have all workflows that take a lot of CPU and RAM and time to run only on w2 and have all the others run on worker1. That will increase your throughput by 100% as you can now run 4+4=9 simultaneous workflows AND you have specialized one of the worker to run long running workflow.

* you can also use QoS per node - see below.

 

https://help.alteryx.com/current/server/worker

 

  • Quality of Service: In an environment where multiple workers are deployed, Quality of Service determines which jobs are run by each worker. When a job request is handled by a worker, it compares the priority level of the job to the Quality of Service value for the worker. Jobs that have a value greater than or equal to the Quality of Service value for the worker will be handled by that worker. For example, if a worker has a Quality of Service of 0 and is available, the worker will handle any request. However, a worker with a Quality of Service of 3 will only handle jobs that have a value of 3 or higher. This allows resources to be reserved for higher priority requests. For normal operation with one machine configured as a worker, set quality of service to 0.
    • 0 = Low (normal workflow execution)
    • 1 = Medium
    • 2 = High
    • 3 = Critical
    • 4 = Chained application execution (all apps in the chain aside from the last)
    • 6 = Workflow validation requests
  • Job Assignment: A specific worker can be assigned to run a job. First, add a job tag for the worker, and then select that job tag when creating a schedule or running a workflow.
    • Run unassigned jobs: Select this option to use the worker to run jobs that have not been assigned a job tag.
    • Job tags: Add words that can be used to assign a specific worker to run a job. Separate multiple job tags with a comma. The same job tag can be added to multiple workers
Paul_Holden
9 - Comet

Thanks Loic,

 

I have done some extensive reading on the options available for managing the queue via server resources and we are considering adding a second worker.

 

>>The 4 simultaneous workflows are what's driving the size of your queue. It can only execute 4 workflows at a time.

Agreed, but that doesn't explain why jobs are taking longer to process than they used to, which is what has now exposed the issue of jobs not running when expected, even considering queuing. It seems possible that we actually may have too many workflows active given our job profiles?

 

At the present time though the main issue is that I can't define the problem that we have and I need to do that to justify the expense of additional resources. The reason being the issue I have stated that jobs are NOT being processed first come, first served (on our server) and therefore I'm struggling to define the impact of any particular job or jobs on the expected execution start time of another job. I should be able to say "your job is running X hours late" because "list of jobs" has to complete first. I can't define "list of jobs" because is seems to randomly include jobs that should run /after/ the delayed job.

 

This morning, at approximately 09:00 I had four jobs running which were scheduled for 03:00, 03:30, 05:30 and 07:30 am 

 

At the same time I had 31 jobs queued with scheduled execution times running from 02:45 to 07:04

 

I am struggling to understand what is happening here.

 

I note patrick's comment and I am considering how to investigate that further but I'm struggling to see how this would impact 31 different workflows many of which have been working without any obvious issues for months now.

Loic
Alteryx
Alteryx

@Paul_Holden 

There are tools to help you understand what's happening. Download the "Server Usage Report" tool at https://licenses.alteryx.com/

This tool is an actual Alteryx workflow that will connect to your Alteryx Server MongoDB and run queries to be able to create an admin dashboard.

You can ouput to xls&pdf or Tableau dashboard (you will need to have Tableau Desktop internally). You can schedule this to run daily or weekly to help you manage your Alteryx Server.

 

Go to the "Alteryx Server" section. Go to the specific version for your server. 

Capture.JPG

 

There will be instructions when you open the tool which is an Alteryx workflow. You can find more information here as well: 

https://community.alteryx.com/t5/Engine-Works/Alteryx-Server-Usage-Monitoring-amp-Reporting/ba-p/356...

https://help.alteryx.com/current/server/server-usage-report

 

Notes:

1- do a remote desktop connection to the Alteryx Server and use the Alteryx Designer that's installed there - that will ensure you can connect to the MongoDB

2- MongoDB password is required. You need the password NOT the admin password. The password can be found in the "Server System Settings" application on the server. There should be a shortcut to it on the server Desktop where Alteryx Server is installed.

 

Also you Alteryx sales or pre-sales contacts can help you identify why there is a bottleneck or strange behavior. we have tools that use the output of the Server Usage report and other future usage assumptions to let you know how much capacity you need (cores).

 

I can see that you opened a case with support. I am going to contact your pre-sales so they can help you out.

Paul_Holden
9 - Comet

Thanks Loic,

 

I'm been using the Alteryx Server Usage report to generate the data that is causing me to raise the initial question.

 

I'll see what support/pre-sales come back with. My reason for raising this as a question here at the same time was that I felt that this might be situational, based on workload profile or specific workflow design, and so possibly more likely to be something that someone else in the community had experienced than something that would be understood by support without escalating to someone more familiar with the engine/scheduler.

 

I'm okay with closing down this thread if you think it is likely to generate more confusion than enlightenment in future readers?

 

BTW I couldn't get the XLSX output to give properly formatted data, the date formats are inconsistent between records/fields and non-standard, and some fields are oddly truncated. The summary pdf is obviously too high level to generate any insight although useful in it's own way. I took a look at the Tableau output but again my skills in that area are somewhat lacking so whilst it usefully highlighted areas of concern in our scheduling profile I was still failing to understand why low resource jobs were being queued for so long. In the end I hacked the Tableau output option to export to a SQL database as my background is SQL server. It's possible that in doing that I've not understood the underlying data schema and what it is telling me?

Loic
Alteryx
Alteryx

@Paul_Holden - Great to hear that you were already using the Server Usage Report. Not a lot of customers know about it and it is a great tool.

If you decide to go with the excel output, it will give your 2 outputs: "AppExecutions.xlsx" and "SessionActivity.xlsx". 2 snapshots below.

 

The dates are Alteryx ISO date format yyyy-mm-dd HH:MM:SS . They can easily be converted in excel or you can plug these excel files to Alteryx Designer: we will recognize these as dates automatically and you can covert it to anything you prefer for your excel output.

Capture1.JPGCapture2.JPG

 

* One of the links I provided has a video that explains how to use the Tableau Dashboard. were you able to watch it?

raychase
11 - Bolide

Hey OP,

 

There is a little known defect with certain versions of Alteryx Server that causes the scheduler to handle the queue via last in - first out methodology. This is why your jobs are sitting in the queue forever. Basically, once they're queued, they won't run until they are the last item in the queue.

 

I discovered this during extensive troubleshooting in our sandbox environment and it was confirmed by an Alteryx Support CSE.

 

This issue was resolved in 2019 versions and beyond. If upgrading isn't an option for you, I can confirm that leveraging the priority functionality will override this glitch. For example, if you want your quick running jobs to run as soon as a processing engine becomes available, you should bump its priority up to something > low.

 

It's laughable that this behavior existed for so long without people noticing. You are now officially the first person that I've seen mention it.

Paul_Holden
9 - Comet

"There is a little known defect with certain versions of Alteryx Server that causes the scheduler to handle the queue via last in - first out methodology. This is why your jobs are sitting in the queue forever. Basically, once they're queued, they won't run until they are the last item in the queue."

 

Thanks, I have had this confirmed by Alteryx support on a call this afternoon.

Paul_Holden
9 - Comet

Hi Loic,

 

When I open the resulting AppExecutions I get the following... (this on Office 365 under Win 10 but the same happens on Office 2010 under Win 7)

 

Alteryx_Server_Usage_Report_openingInExcel_001.PNGAlteryx_Server_Usage_Report_openingInExcel_002.PNG

 

The XML is just a repeat of the window message...

 

<?xml version="1.0" encoding="UTF-8" standalone="true"?>

-<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">

<logFileName>error165480_01.xml</logFileName>

<summary>Errors were detected in file '<Redacted Path>\AppExecutions.xlsx'</summary>


-<repairedRecords>

<repairedRecord>Repaired Records: String properties from /xl/worksheets/sheet1.xml part</repairedRecord>

</repairedRecords>

</recoveryLog>

 

The resulting Excel workbook looks like this... (App and Runner information redacted)

Note the odd date formatting and the truncated strings e.g. War for Warning? Sche for Scheduled etc.

 

Alteryx_Server_Usage_Report_openingInExcel_003.PNG

 

This is not even consistent across the entire report, further down the data formats change and the truncation is more severe...

 

Paul_Holden_0-1594916057819.png