Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.
Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Running batch macro in parallel

rohanonline
10 - Fireball

Hi,

 

I have a batch macro that takes 30 seconds to generate the required output. I have to run it on 50,000 input data sets. This would mean that I would require 25,000 minutes to complete the process.

 

I have split the input data into 50 files, each with 1000 rows and then I used 50 instances of the same workflow from where the macro is being called. 

 

This brought down the total processing time to 500 minutes. Is there a better way to achieve this? Running 50 instances is possible, but definitely a pain.

 

16 REPLIES 16
fharper
12 - Quasar

Not sure if you found a solution you like but here are some thoughts.  First the main overhead is the file open and close, not so much the read of data.  Given that there is little way around the cost of file open and close what might be the best approach is to scale the process through parallelism.  One caveat...my approach depends on having the scheduler license, this is available for the designer as an add-on and also is part of the Server product. the Scheduler/API feature allows you to run from the command line.

 

You can build a flow that uses a directory tool to get a full list of input files.  Then divide the 50k file list into N groups and then build a series of batch files dynamically, one for each group of files.  The same flow also builds another batch file that will launch all the group batch files. The group level batch file runs the actual workflow to read and process the files.  That workflow would use a dynamic input tool to read the files in from the list of files for that group and process them serially. 

 

In this way you have taken a list of 50k files and divided them into N groups, say 100 groups of 5k files and each list has its own job to process that list and these can be run simultaneously, depending on the resources on your system and how sophisticated you want to get. 

 

The number of workflows you can run on a machine varies according to the resources available, memory perhaps most of all. If you have a number of machines you can remote access from each other you can build the batch to run some of the group jobs on different machines.  Command line as well as powershell support this.   In a single machine implementation, given the potential number of jobs you might want to throttle the batch submissions so that no more than 15 or so fire at a time.  This can be done several ways, by using the "wait" option in command line on every 15th or 20th submission of a group batch, other more complicated ways are available but too lengthy to explain here.

 

If you don't have the scheduler/API option to your license or if you don't have the Server product you could dynamically generate xml to feed an "APP" and make your workflow an APP, then basically do the above to submit the APP instead of workflow in mass.

 

In essence you can fairly easily set up a means to submit multiple parallel instances of a job, each processing its own subset of the files.

In essence

 

 

fharper
12 - Quasar

forgot to cover something.  if you divide up the list into groups, whether you do it within a single file or separate files by group, you need to find a way to have the same workflow read them in multiple concurrent instances without conflict or duplication.  That is doable with flat files but takes some effort, mainly in timing of job launches and probably use of a block until done on the read. 

 

I would probably read the file list flagging the first N rows in the list as processed then write out the file using a block until down to insure that happens first, then filter the unselected rows off and run the rest of the flow to read and process the selected files.  Stagger the submission of the batch jobs running the instances so have a 30 second pause, give or take, to all each instance to grab their portion fo the list and write the updated file list for the next instance to read.  Within 5 or 10 minutes you have 10 instances running their group of files.

 

That said a preferable approach might be to write the file list to a table, build yourself a postgres or other free DB with a file list table with added columns for processed flag and times tamp and any other tracking info you may want.  Take advantage of record locking.  In this scenario grouping is not necessary.  The the workflow that processes the files can be launched N times and each will read through the same table selecting the next record not locked and flagged as not processed.

 

Whether you have 10k or 50k files doesn't matter, the throughput depends on how many parallel instances are running.  If you run 10 instances each will chew its way down the list effectively dividing the 50k into 5k files per instance.  In reality, as files vary in size and such, the distribution of files per instance will vary somewhat as some speed through smaller files and others may have larger files and thus dip into the list at a slower rate.

 

I have not tried the DB method and I am not entirely sure it would work as I describe as I am not sure the input tools to read the table allows open for update with record locking.  so it may be something where you load a table instead of a flat but process essentially the same way as the flat file method I describe, staggering the launch to allow each instance to grab its group of files.

 

Ozzy_Campos
8 - Asteroid

@rohanonline 

 

Did you get an answer to this? 

 

I'm in the exact same situation, I need to run a process 33k times a several times per week, it takes a few seconds to run each one.  

 

Did you switch to a computing cluster to run it? I borrowed a high-speed computer to do mine, but it wasn't much faster than my run of the mill computer, so I wasn't sure a computing cluster would do much better if it's still based on the same bottlenecks.

fharper
12 - Quasar

I still do a flat file divide and conquer approach.  I haven't reached a point that forced me to do otherwise.  as to getting more resources, faster machine more memory or going to a cluster it is not a straight forward equation...more horsepower does not increase throughput necessarily and usually not in a linear fashion.

 

the activities of "file open" and "file close" are the largest overhead in flat file processing when the files are small...low record count... so opening and closing is where most of the overhead is on processing 33k files of a few hundred or thousand rows each.  you can typically read the same number of rows from a single file far faster than the same rows distributed across 33k files.

 

more memory or CPU does almost nothing for the actual speed to open or close a file.  Memory can help once you are reading, especially if you are using multi-row tools or other logic that is memory consuming (heavy joins and such)...Alteryx doesn't really eat CPU unless you are doing a few specific things like data modeling. 

 

Using the flat file approach I have a job, Job A, that when first set up took 1:39 minutes...I created another job to break up the file list of ~5k files into n groups of 1000 using the output tool with max records per file configuration.  Then this new job read in Job A, there is a special way to configure that and to save the result, modified the input file name in the xml and wrote the xml back out as Job A1, thru Job A6 or however many. then it wrote a batch file to execute each of the jobs just built in parallel, I have Server so you can use the API or command line to execute things.

 

I also had added a tracking file for Job A so I can know when done and see stats.

 

I have another job that runs every 10 minutes that looks for files to clean up...I do a bit that is dynamically generated so this does my cleanup...I added logic to look for and read the tracking files so when the jobs finish I get an alert and the dynamically generated job A1 thru An are deleted as are the tracking files.  And then it is ready to be run again the next day.

 

Run time went down to about 20 to 25 minutes depending on infrastructure load...our network can be slow at times

 

Options increase if you have Server or the scheduler/API add-on license on a desktop because they give you ways to automate running a yxmd, by API or command line...If you have neither you can do the job to breakup the list and build the "n" number of job clones, each preset to its own portion of the list... then manually start each and let them run. then delete them or reuse depending on your situation.   there are many ways to skin the cat.  Crew Macros have several "runner" macros that allow you to fire a basic yxmd from another flow easily but they are all limited to running serially vs firing a number of jobs in parallel.  But you might find a way to make one of them work for you if you disect the macro and tweak it.

Ozzy_Campos
8 - Asteroid

@fharper Thanks for the response - this is the third time that I built a big project in Alteryx but will end up shifting to R. In this case, it's easier to break up the data and dedicate separate computing clusters to subsets of the 33k through script.  I don't have Alteryx server, so a bunch of workflows that someone else will end up having to manage probably isn't going to work. 

 

Been a good learning experience though.  

Dkundu
6 - Meteoroid

Hello - I have use case to execute Altrex workflow in parallel. I have workflow parameterized using a one element which is getting data from Postgres and teradata and performing some comparison and creating an output file . This workflow is highly optimized but it takes 1 hr time to process the comparison as avg data volume is 50m . Now we want to execute the workflow in parallel say 5 parallel process by passing different parameter so that we can perform 5 comparison in 1 hour rather sone spending 5 hours . Could you please let me know how to do this in Altrex? 

 

fharper
12 - Quasar

@Dkundu this should probably be a new thread but I see the similarity.  Sorry for the late response but I got busy and did not look at this until now.... If you have not solved for this I have some questions, if you did it would be nice if you share your solution so others will be able to benefit...

My questions

  1. Do you have a Server to work with or a Designer with the Scheduler/automation add-on license?
  2. Will the 50m rows change based on the parameter you are passing for comparison?
    1. Is the parameter used as selection criteria in a dynamic query (Dynamic Input Tool maybe)?
      1. or is it a value to be compared to values within the 50m rows of data extracted by the query?
    2. If you would be reading the same 50m rows in each iteration/instance of the job and simply comparing a different (parameter value) to that same data then you may not need parallelism.  
  3. Did you do performance profiling to see where your time and resources are spent?
    1. extending a bit on Q1 it may be more beneficial to split the job into 2 parts, mainly if the parameter does not change the query.
      1. Part One is the query where you get the 50m rows and write to a yxdb maybe...YXDBs are very efficient flat files and Alteryx processes flat files very fast.
        1. Especially if each parameter would be applied to the same 50m rows initially extracted by the query.
      2. Part Two would read in the yxdb copy of the 50m rows and process the parameter against it.
        1. This Part Two might be replicated to effectively provide parallelism
          1. YXDB's being a flat file that does not file lock on read you can have any number of duplicates of the job running in parallel reading the data in and processing their own Parameter value.
      3. This reduces workload on the data server being queried as it (Part One) happens only once instead of N times concurrently.

since I am short on time my point here is a rethink on the design might bring greater gains.  parallelism is a valuable thing but if your query is a heavy load on the data server or that server is frequently burdened by the aggregate of workload it experiences then multiplying the number of queries and especially long running duplicate queries can exacerbate or create problems.  Where a simple rethink may offer more efficiency.

 

But since I do not know your query or the variance in parameters for each instance I can't really say definitively what the best option is.  Feel free to share the workflow if not revealing privileged info or contact me direct.

Labels
Top Solution Authors