Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.
carlosteixeira
15 - Aurora
15 - Aurora

Scenario

 

Alteryx Server has functionality that we can configure in Alteryx System Settings that cancels jobs that have been running for a certain time limit.

 

carlosteixeira_0-1644606965770.png

 

Above, we can define, in seconds, the maximum amount of time allowed for a workflow to be executed. For the environment we were working on, this limit was 3600 seconds (1 hour).

 

Problem

 

In this scenario, any workflow that was scheduled and executed for a time longer than the configured 3600 seconds would be automatically canceled by the system.

 

It turns out that this setting only works for workflows that are scheduled in the Gallery. If the user runs the workflow manually through the Gallery, this configuration does not work.

 

The user put the workflow to run manually and left it running for hours.

 

In this scenario, in large environments, with many users doing this we have the problem of the execution queue. Many workflows running at the same time in the Gallery, manually generating a queue of more than 20 workflows waiting to be executed, even though they are scheduled.

 

For us environment administrators this is almost impossible to manage because we don't know which workflows can be canceled or not.

 

carlosteixeira_1-1644607055377.png

Solution

 

When we realized that this was impacting our environment, the client asked us to study a possible solution. With the help of my friends @Thableaus and @marcusblackhill we managed to arrive at the solution below.

 

We created a workflow that monitors what is running on the Alteryx Server and if it identifies a job that has been running for more than a certain time, regardless of whether it is scheduled or manual, it cancels the job via a Download tool call.

 

carlosteixeira_2-1644607087141.png

 

Below I will explain a little bit of the flow that is divided into 4 parts.

 

Part 1

 

In this first part, we went to MongoDB to identify which sessions we had active within the database, and using Filters/Summarize/Joins we got only 1 active session to be used as a credential at the end of the 4 processes.

 

Along with this section, I also include the Gallery URL.

 

carlosteixeira_3-1644607169641.png

 

Part 2

 

In the second part of the flow, we also went directly to the base of MongoDB to identify which workflows are running at that moment.

 

In the example below we can identify 1 workflow being executed.

 

Using filters and formulas, we were able to calculate how long this workflow has been running. The first filter identifies if the workflow is in “running” status, goes through the formula tool, and calculates the execution time of this workflow. In the second filter, we check if the workflow is running longer than the stipulated time.

 

In the third filter, and here it was also a request made by the client, we included the workflows that can evade the cancellation rule. For that, in this filter, we include the names of the workflows that enter in the exception cases.

 

carlosteixeira_4-1644607216833.png

 

Part 3

 

At this stage, also going to MongoDB, we were able to identify who is the user responsible for that flow (owner) and create alerts that are sent to administrators and to the user himself/herself that the workflow executed was canceled because it was running longer than the stipulated time, per company guidelines.

 

carlosteixeira_5-1644607239522.png

Part 4

 

In this last step, we create the variables necessary for the system so that we can create an HTTP call and send the necessary credentials for the command to be executed by the system as if we were canceling the workflow manually.

 

So when a workflow is identified with a longer execution time than that stipulated by the company, it will be canceled.

 

carlosteixeira_6-1644607558834.png

 

Implementation

 

For this workflow to work, we implemented it as follows.

 

Inside the Server (Alteryx Server) we include the workflow in a way scheduled by the Alteryx Designer to run every 5 minutes. Note that the execution time of the complete workflow is very fast, about less than a minute.

 

carlosteixeira_7-1644607598470.png

 

carlosteixeira_8-1644607611116.png

 

In this way, the workflow does not consume one of the execution slots that is configured in Alteryx System Settings with the number of workflows that can be executed simultaneously.

 

I'll leave the workflow attached in case any of you want to implement or study and even improve the process in your environment.

 

Tip: in the environment where this was implemented, I even included some alerts to anticipate the information for the administrators, for example, receiving an email if the workflow reaches 90% of the pre-set time. This way, the administrator can have an early idea of which workflow(s) will be canceled.

 

Another tip, which I used in the environment in question is that we disable the configuration in Alteryx System Settings to cancel the jobs there.

 

Until this feature is implemented in new versions of the Alteryx Server so that regardless of how the workflow is started it falls into this cancellation rule, this is a solution that has served the customer well and can be implemented quickly.

 

Some processes were implemented by the client's team in order to have better governance of the environment, such as requesting the inclusion of workflows in exceptions so that they can run at any time regardless of the stipulated time.

 

With this workflow process of great relevance to the business, they were included in this list of exceptions, and we also evaluated together with the user the best time for it to be scheduled or executed.

 

That's it. Any questions or curiosity about this workflow and how to implement it feel free to contact me, I'll be happy to help.

 

Cheers

 

Carlos A Teixeira - Alteryx ACE

Comments
MattBenj
9 - Comet

Thanks for sharing @carlosteixeira! I like the added touch of sending the user an email alerting them of the cancelation. You mentioned:

Until this feature is implemented in new versions of the Alteryx Server so that regardless of how the workflow is started it falls into this cancellation rule, this is a solution that has served the customer well and can be implemented quickly.


Do you know if this enhancement is on the roadmap for future versions?

 

Thanks!

carlosteixeira
15 - Aurora
15 - Aurora

Hey @MattBenj  how are you?
Look, what I know is that there are some requests about including this cancellation for both scheduled and manual, but I don't know when and if it will be implemented in the next versions.


We hope so.


Here are some links with this idea requested by some users:

 

https://community.alteryx.com/t5/Alteryx-Server-Ideas/Configuration-to-set-the-Timeout-of-a-manual-r...

 

Thanks

fmvizcaino
17 - Castor
17 - Castor

Great content @carlosteixeira ! Very helpful

NicolasSz
11 - Bolide

Thanks a lot for your solution. This will I guess the trick until Alteryx supports it.

This is quite a basic feature for a server product ! 

Ariharan
11 - Bolide

Hi Carlo,

 

Thanks a lot for your idea. In our case completion time is not updated for running/queued jobs. We tried to use the Alteryx server usage report data parser (As Result & As queue) and we found that it does not contain execution start time and queue time values for running/queued jobs. When jobs are running/queued, it is captured as null. 

In the example given, there are four scheduled jobs running in the server at the moment and one manual job is being triggered by the user and is in a queued state for more than 30 minutes to start running. However, in MongoDB, when a job is triggered that time is captured as the start time (not considering the queue time). If we consider the start time of As_Queue in our workflow, we ended up with a mess. Since that manual job is queued for 30 minutes and running for only 3.30 hours, it is being killed by our workflow. It should only be killed after 4 hours.  

 

  • How did we determine the total queue time & the execution start time for the running/queued jobs? 
  • How do I kill this job automatically after four hours while taking queue time into consideration?

Thanks & Regards, 

Ariharan.R

Ross_K
7 - Meteor

Hi @carlosteixeira,

 

Love the solution. One issue we encountered is that if all of the nodes are bogged down, this workflow cannot run :P

So instead we are using Python to monitor the current active workflows by querying the queue on Mongo. We then need to shut down any that exceed the time and send the appropriate notification emails.

Querying Mongo is easy, constructing the url is easy, but we don't seem to be able to authenticate to cancel it.

For the normal API endpoints we use the api keys, generate an OAuth token and use that. From your workflow it looks like you are using basic auth with a username and password?

We have tried this basic auth api keys and with username/password login credentials both for the user who owns the workflow and for admin, the result is always the same: 401 error, {"data":null,"exceptionName":"UnauthorizedException","innerExceptionMessage":"","message":"User is not authorized to perform operation."}.

Any suggestions on which credentials we should be using here or any specific permissions they need?

 

Thanks :)

Ariharan
11 - Bolide
Hi @Ross-K,
 
When all nodes are down, manual jobs do not run in the Gallery right.? 
 
One more suggestion, instead of using a Gallery schedule, you can use a Legacy scheduler (Designer). 
 
Regards, 
Ariharan.R
 
 
 
Ross_K
7 - Meteor

@Ariharan,

It is not that the nodes are "down", they are simply occupied running workflows. If another workflow is executed (either manually or on schedule), it will be queued. this would include any workflow intended to shut down any long-running workflows :P

This is not an issue with the Gallery scheduler, that works well. There is not need to return to the legacy scheduler.

AndrewSu
Alteryx
Alteryx

@carlosteixeira , i have a partner asking the following after I highlighted the portion of your solution where you are scheduling via Designer and not server.

 

"This would then require a PC with a designer license to be online, connected to VPN, 24x7, correct? This isn’t a feasible requirement for us given our security + VPN settings. We’d need to leverage our jump box for this but there were licensing concerns about putting a designer license on a shared box."

 

I believe their understanding is wrong because this workflow has to be run on Designer on the machine that's hosting the Server correct?

carlosteixeira
15 - Aurora
15 - Aurora

Hi @AndrewSu 

 

Exactly, the understanding is wrong. They just need to configure the workflow directly on the designer installed on the Alteryx server machine. Isn't necessary another machine with the Alteryx designer installed. I was running this workflow directly on the Alteryx Server.

 

 

Best Regards

 

 

MattBenj
9 - Comet

Is there a specific version of server you must have for this to work?

carlosteixeira
15 - Aurora
15 - Aurora

Hi @MattBenj 

 

Not at first, maybe you need to tweak some things but it should work in any version

 

Thanks

adarsh2707
8 - Asteroid

Hello @carlosteixeira 

 

Whenever I run this from my server, It keeps throwing the error - authentication error: generic server error. Do you know what that indicates

adarsh2707
8 - Asteroid

Hi All,

 

I have a doubt,

my workflow is running till the last step but at the download step it reads the below error

adarsh2707_0-1669207161473.png

Can someone please help me in resolving this error

 

arun0721
5 - Atom

@carlosteixeira Thanks for sharing this workflow. However I always get a 401 during executing the api to cancel jobs, I am running it on the server directly. I see that @Ross_K also encountered this. Any idea or suggestion for resolution would be great.

Thanks

Ross_K
7 - Meteor

Hi @arun0721 ,

 

We ended up implementing a custom solution for workflow timeouts that did not involve having to run a workflow to cancel other workflows. Instead it uses MongoDB queries to identify how long workflows have been executing and a front-end http request to cancel any that have been identified as exceeding specified timeout limits.

Bargeton
7 - Meteor

Hello,

 

On my side, this workflow retrieve my jobs running more than 1 hour, but it killed nothing...

 

When checking the worflow, it does a call to this url:

[base_url_login]+'api/jobs/'+[JobId]+'/cancel/'

 

But:

1) I see nothing to authenticate to the webapi with API key & secret into this workflow.

2) From my server, checking the webapi documentation

https://my_alteryx_server/webapi/swagger/ui/index

And I have to switch to doc/1 for "job" api... and there is no "cancel"

I see only 2 points:

 

- get /v1/jobs/{id}/output/{outputId} 

Get output for a given job

 

- get /v1/jobs/{id} 

Retrieves the job and its current state

 

So I'm wondering if this workflow can still work in 2023 with last version(s) of the Alteryx Gallery.

 

Regards,

 

Alex

carlosteixeira
15 - Aurora
15 - Aurora

Hi @Bargeton how are you doing?

 

So, I dont test it yet in the 2023 version.

Probably the applications should be managed to work.

I'm planning tod o it in a few days forward. 

If I found the solution I update this post.

SeanAdams
17 - Castor
17 - Castor

thank you @carlosteixeira - this is very useful and necessary for large server implementations! 

 

cc: @moinuddin 

carlosteixeira
15 - Aurora
15 - Aurora

Thanks @SeanAdams.

 

It's been a while since I implemented this and it's still working for my client.
But unfortunately, I haven't tested it for the new versions yet.
It's on my to-do list for this year. hahahahaha

 

Thanks again

Arti2893
6 - Meteoroid

Can you please explain what parameters are being passed to download tool in detail

 
 
 

cancel.PNG