Fix Gallery crash on high CPU load / HTTP 404

Question

During our "busy season", the web frontend of the Alteryx Gallery sometimes crashes and returns HTTP 404 until the Alteryx Service is manually restarted.

The Alteryx Engine is still running in the background, but users are complaining because they cannot track the progress of their workflow in the Web UI.

Restarting the Alteryx Service requires to forcably close all Alteryx Engine processes.

I found someone with the same problem in the community, but there was no solution yet.

https://community.alteryx.com/t5/Alteryx-Server-Discussions/Gallery-goes-inaccessible-frequently-404-File-or-directory-not/m-p/465984/highlight/false#M4727

An investigation with one of Alteryx Support Specialists revealed that this crash is caused by a lack of CPU ressources on the server.

(The CPU is exhausted by a third-party app, which is used by our Alteryx workflows, but this can also happen due to R or Python scripts)

From a server application on an enterprise level, I expect robustness even when dealing with low hardware capacity.

I am Ok, when the Gallery is not available or slows down, when the CPU completely occupied.

But I expect an enterprise application to recover once resources are available again without losing any information or progress.

I wish that the Alteryx Gallery does not crash anymore, when the CPU is running at 100% for longer time, or at least automatically reboots after a crash.

asmith · Answer

Hi @leonhast ,

My name is Austin from the server support team here at Alteryx and I want to thank you for your input on our Server product. I just wanted to clarify a few points you have brought up and our recommendations for this issue. We actually consider this to be more of an issue with configuration rather than a resource exhaustion issue. As we don't expect to not have CPU availability or at least not for an extended period of time and have a few performance guidelines to follow to put you in the best spot for resource allocation.

Specifically, we have a 5 minute timeout period for Gallery. If we are unable to receive ping responses from the gallery node for a 5 minute period our service asks for a restart. We send a shutdown request to gallery after this time, when a HTTP 200 response is received affirming that the gallery will shut down we wait for the gallery process to exit then we request a new gallery process to be spawned. If we see a response code other than 200 to the shutdown request, which we will send 3 times, we terminate the gallery process and do not spawn another one as we do not believe the gallery will properly spawn again and require user intervention.

We specifically do not anticipate that the full CPU will be used, leaving at least some overhead for all of the processes running on a single-node Server environment. When all the resources are used not only are our services fighting for resources, the OS is also fighting for resources when trying to hand out resources to all of the processes that are requesting them. If you are configuring your server to use all resources for the workflows when running and not leaving any overhead for other processes this is what is causing your issue. We can handle a high load for the CPU as long as we are able to retrieve those resources within a few minutes. If the server is capped at 100% CPU usage it is highly unlikely that the Controller/Gallery will receive the necessary resources to continue running effectively.

To put it another way, it is like when you are running processes in your operating system and you max out all of your resources, all of the less important processes will be put at a lower priority to those needed to continue running the operating system. Then followed by higher priority processes, then medium priority and low priority. This is true for our process as well. The Alteryx Gallery is more of a front end GUI to operate Alteryx. However, the controller and the Engine commands will take precedence over Gallery as Gallery is not critical to the functioning of Alteryx. The information that Alteryx Gallery accesses however is persisting and stored in the MongoDB database and updated properly even though the Alteryx Gallery is down.

We can almost completely avoid this issue by changing the configuration of your server slightly with minimal performance drop offs in workflow execution times. If you would can you open a new case with Alteryx Support and we can look through your configuration and make recommendations to be sure that our processes have adequate resources available and give you the most uptime in your server environment?

Thank you,

Austin