We have approx 80 workflows that are scheduled to run every morning between 5am-9am. (Our source databases complete refreshing around 4am so we are unable to begin any earlier.) It's become such a log-jam that we have begun to consistently get a 60min+ delay with our scheduler. This also seems to be effecting our gallery apps as well - causing latency of the same amount of time when running any them during this time.
Are there any best practices to help alleviate these delays?
Are there specific items that could be selected w/in these workflows that could help?
Is there a way to increase queue size?
Should Apps have their own server to avoid this downtime?
Have we reached the limit that a single server can handle in this type of situation and we should think of upgrading/adding on?
How have others handled this situation as their user base has grown?
While the expansion of our user base has been awesome and adoption has skyrocketed, some of these growing pains are becoming painful.
Thoughts appreciated
Solved! Go to Solution.
A good way to think about bandwidth on Alteryx server is that it is essentially an in-flow/out-flow problem. The number of workflows isn't so important as it is the time it takes to run them.
In general, the number of concurrent processes that can run is ([CPU Cores]/2) - increasing this number doesn't necessarily help, as the constraint is often the CPU available, so you can do 4 things at once, but they will each operate with half the amount of available CPU, thus not increasing throughput.
A trick might be to prioritize these based on their size - push the smaller ones through first. It would definitely be less ideal if you have 40 small processes (1-2 minutes each) stuck behind several that take 30+ minutes to execute.
I would also look for an opportunity to optimize the workflows themselves - its very common for users to build something as a 'full refresh' when the workflow can be built to pull data incrementally instead.
If all of the above fail to yield anything helpful, then it might indeed be time to consider increasing the size of your server. Out of curiosity - how many CPU do you have licensed for your server?
I was thinking 4, but based on your formula ([CPU Cores]/2) and seeing that we usually have 4 workflows running in our queue, we have 8.
Most of the workflows are dependent on an earlier workflow running. So we front load the data source creation workflows to run early(30-45min run times) then everything that uses those data sources are scheduled to run after. We've tried to sprinkle any remaining longer workflows to run later in the time-frame to spread out the run times, but we are still experiencing the jam.
As you stated, we recognize that our next step will be to review each workflow to determine if there are any opportunities for efficiency gains that may have been missed by a 'newer' user.
Not really sure if it's out there or not, but it would be great to get some quantitative feedback such as "when we hit 200 users/200 scheduled workflows/200 apps on the gallery etc". we added another server to our stack and immediately saw benefits. We'd really like to plan accordingly for the future re: plan to add vs throttle use down...
How coordinated are these data pulls? One possible solution would be to create an input list of jobs and use a CReW runner macro to run through the jobs in sequence (or split x wide). By doing this you only tie-up that # of jobs from running and leave open slots for other work to process.
Jobs that are selecting data from a database use virtually no CPU or memory. The rule of thumb for concurrent processes can almost be ignored. Unfortunately, there isn't a way to dynamically allow for a 5-9:00 max number of jobs. I also wonder if your data is updated or if it is appended each day. In other words, do existing records get deleted or are values changed? Do new records get inserted? If the only data changes are new records, you can maintain a shadow copy of the data and get only new records to add to your copy of the data. If it is the case where data is deleted or changed, then you might be able to use "update dates" to find changed records. All of this requires some design considerations.
Just a few random thoughts.
Cheers,
Mark
@No-Sass - The way you would measure this is the in-flow/out-flow equation.
If you can do 4 simultaneous processes, that means in a given 60 minute period, you have 240 working minutes of Alteryx Engine time. Therefore you would add up the execution time for each process that needs to occur in that 60 minute window.
For example, with 240 minutes of processing time, you could do 240 1-minute workflows or 120 2-minute work flows, etc.
Users don't necessarily drive this, its all about what workflows are being hit. Its also important to factor for the workflows that get added to the queue from the Gallery (not scheduled, but executed from the gallery).
Let me know if this helps!
Zak
PS - @MarqueeCrew's post is also spot on and provides other factors to consider.
All the above suggestions are recommended. It can be a very bespoke solution to fine-tune a server running close to capacity.
Some other ideas to look at:
Also, in terms of the gallery apps slowing down, there is an option in the System Settings under Engine > General that should help resolve this a little. This may not be as much help if it is a couple of really long jobs that are concurrently executing and holding things up.
We are going to dissect all the flows (Can't wait for Connect) and identify the opportunities to enhance and 'daisy-chain' (as Mark suggested) and shift those to a production server.
Test the "Run at lower priority" suggestion and see if that provides any relief.
Anticipate both of these will provide some relief and we can begin to gauge when we will need to add to our quiver and plan appropriately.