Long running server jobs won't cancel

Question

Hi all,

I'm trying to figure out the cause of an issue with our internal Alteryx server to avoid re-creating this problem.

We had a series of jobs running every two hours successfully for about a week.  Last weekend, two of them ran LONG (as in, for days).  This is strange because

- There's no reason these should go for anything over 1.5 hours.

- My understanding is that our server has a 5 hour time limit set.

- When trying to cancel these jobs manually from the server, the task status would change to "cancelling" and then back to "running" on a refresh.

- This happened on two different worker nodes, one with a Friday evening job, the other with a Saturday afternoon job.

The "solution" to this was for our sys admin to reboot those two worker nodes.  I haven't been able to find logs for the long running jobs.

If anyone has any thoughts about what might cause this and where to look to avoid recreating it, that would be marvellous.

raychase · Answer

@KJennings - when we were running with a 1:1 processing engine to CPU cores ratio, I would estimate that we'd see the perpetually running job about once per week.  Keep in mind it would also be load dependent.  We never noticed any problems when we first scaled up, but the problem got progressively worse as more and more jobs were scheduled to run on the Server.

Since we've added additional hardware resources (to maintain the 0.5 engines per core ratio), these scenarios have completely disappeared.

From my experience, it's highly advisable to follow the vendor's recommendations when it comes to CPU/RAM configurations on your host machine.

KJennings · Answer

@raychase

Thank you for your response.  Our location is running a single server configuration, no nodes, but definitely with a high job to CPU ratio.  How often do you find that this occurs in your configuration?

Kevin J

raychase · Answer

From my experiences, this behavior occurs when the worker nodes are configured to run too many simultaneous jobs. If your # of simultaneous jobs exceeds 0.5 x the number of CPU cores on the host machine, you run the risk of these sorts of perpetually running jobs.

Unfortunately,  when this behavior occurs, the only identified mechanism of freeing up the processing engine is to reboot the services on the worker node.