I have two copies of the same job stuck in the Initializing status in Alteryx Server. All scheduled jobs are now getting queued and not executing. Multiple service restarts and server reboots have not resolved the issue.
I found this in the log file repeating many times:
"S:\Alteryx\Service\AlteryxService_Client\Persistence_MongoDB.cpp: 1558. PersistenceContainer_MongoDBImpl_Get_Error: Record identifier is invalid <ID_REMOVED> collection <AS_Schedules>" "PersistenceHandler_ReadBody_UnknownError: Resetting persistence containers and rethrowing."
While the server thinks these two jobs are initializing, there's no running AlteryxEngineCmd task.
I can see the two jobs in MongoDB in AS_Queue.
I tried to stop the job using Gallery Admin > Jobs > Status, but the jobs remain. I also tried using Options > View Schedules in Alteryx Designer on the server, but the "Delete Queue Entry" button is disabled for the two jobs in the Initializing status.
"AlteryxService version 2021.1.2.20534 (c) Alteryx, Inc. - All Rights Reserved."
Any suggestions on how to get the server up and running again during this holiday weekend? I opened a case with Alteryx Support on Friday morning, but haven't heard anything back.
We upgraded to 2021.1 on Friday and have the exact same problem. Two jobs are stuck in the queue as initializing and server reboots and service restarts (though we're forced to kill the process to get it to restart) have failed to fix the problem.
I'm still trying to get to the root cause. After a suggestion from support and a bit of my own experimentation, I was able to get rid of the jobs. However, I don't recommend following the same path until we find the actual cause.
In my testing, I seem to be having problems with workflows that use gallery-defined Data Connections to SQL Server, while workflows that contain their own connection strings seem to be okay. If you do your own testing, I suggest unchecking the workflow validation box when saving to server. You can then use the Run button in Gallery, which will timeout after 30 seconds rather than getting totally stuck.
Again, I don't suggest trying this yourself, but I'll give you the details anyway. Note that I have a single server configuration acting as the controller and worker. To remove the stuck jobs, I did this:
- Changed the server config to NOT run unassigned jobs and added a Job tag that isn't currently used by any workflows
- Restarted the Alteryx Service (had to manually kill the task to complete the restart)
- Went to Gallery Admin > Jobs
- I was now able to use the Minus button to delete the jobs
I have narrowed down the issue to the Gallery database connections that are mentioned in @hody's post. To re-iterate, if I change the Input tool connection to not use the Gallery connection, the process works fine. So there seems to be an issue with the way Alteryx Server initializes a workflow using a Gallery connection.
This issue has manifested AFTER upgrading to Server 2021.1.