Alteryx Server Discussions

Find answers, ask questions, and share expertise about Alteryx Server.

Long running server jobs won't cancel

livsmith
5 - Atom

Hi all,

 

I'm trying to figure out the cause of an issue with our internal Alteryx server to avoid re-creating this problem.

 

We had a series of jobs running every two hours successfully for about a week.  Last weekend, two of them ran LONG (as in, for days).  This is strange because

 - There's no reason these should go for anything over 1.5 hours. 

 - My understanding is that our server has a 5 hour time limit set. 

 - When trying to cancel these jobs manually from the server, the task status would change to "cancelling" and then back to "running" on a refresh.

 - This happened on two different worker nodes, one with a Friday evening job, the other with a Saturday afternoon job.

 

The "solution" to this was for our sys admin to reboot those two worker nodes.  I haven't been able to find logs for the long running jobs.

 

If anyone has any thoughts about what might cause this and where to look to avoid recreating it, that would be marvellous.

6 REPLIES 6
DiganP
Alteryx Alumni (Retired)

@livsmith Very strangeYou can probably send a ticket to support to understand the behavior. Please include the logs from the worker node (s) + controller node.

 

Logs:

Service Logs - C:\ProgramData\Alteryx\Service
Gallery Logs - C:\ProgramData\Alteryx\Gallery\Logs

Digan
Alteryx
JohnBell
8 - Asteroid
If you go to the worker node, and go to the view schedules, are you sure it wasn't stuck in the "initializing" stage... I've had that happen before, but you do need to go the node and go to view schedules. If that's the case, "initializing", then yes, I've had to restart the service in which case it becomes "unstuck", and finishes running successfully.
KJennings
9 - Comet

Has a cause for this issue been determined?

 

Our location is running Alteryx server v2018.4.5, on a lone server, and at the moment we have three workflows simultaneously exhibiting this behavior. All three workflows were initiated as scheduled tasks.  One has been running four days, the remaining two for two days.  We are not completely dead as we still have a couple of slots available to run tasks, but are concerned that we could potentially lose use of the server at any moment.

 

We are going to attempt a service restart/reboot after business hours to minimize user impact.

 

Our concern is that we do not know what is causing this to happen, and do not know where to look to investigate a reason.  We need to know what is causing this so that we can proactively address the issue and prevent future occurrences.

 

Thank you,

Kevin Jennings

raychase
11 - Bolide

From my experiences, this behavior occurs when the worker nodes are configured to run too many simultaneous jobs. If your # of simultaneous jobs exceeds 0.5 x the number of CPU cores on the host machine, you run the risk of these sorts of perpetually running jobs.

 

Unfortunately,  when this behavior occurs, the only identified mechanism of freeing up the processing engine is to reboot the services on the worker node.

KJennings
9 - Comet

@raychase 

 

Thank you for your response.  Our location is running a single server configuration, no nodes, but definitely with a high job to CPU ratio.  How often do you find that this occurs in your configuration?

 

Kevin J

raychase
11 - Bolide

@KJennings - when we were running with a 1:1 processing engine to CPU cores ratio, I would estimate that we'd see the perpetually running job about once per week.  Keep in mind it would also be load dependent.  We never noticed any problems when we first scaled up, but the problem got progressively worse as more and more jobs were scheduled to run on the Server.

 

Since we've added additional hardware resources (to maintain the 0.5 engines per core ratio), these scenarios have completely disappeared.

 

From my experience, it's highly advisable to follow the vendor's recommendations when it comes to CPU/RAM configurations on your host machine.