Hi,
I have a batch macro that takes 30 seconds to generate the required output. I have to run it on 50,000 input data sets. This would mean that I would require 25,000 minutes to complete the process.
I have split the input data into 50 files, each with 1000 rows and then I used 50 instances of the same workflow from where the macro is being called.
This brought down the total processing time to 500 minutes. Is there a better way to achieve this? Running 50 instances is possible, but definitely a pain.
Solved! Go to Solution.
My first thought is maybe it can be approached differently. Is it possible to share the macro, or to make some sample data (not real) that represents your situation, before and after?
@Joe_Mako wrote:My first thought is maybe it can be approached differently. Is it possible to share the macro, or to make some sample data (not real) that represents your situation, before and after?
Please see the sample workflow. These are fairly simple - i want to run regression on different sets of data. In reality the data file is about 38 million rows and the product list (text input in the sample) is about 55000 rows.
Thanks for looking into it.
Hey @rohanonline
I'll openly admit that your question is beyond my depth - but my first attempt would be to look for options for how to farm this out to a process that can run in parallel (i.e. get it off your desktop / server, and into a cluster).
Not sure if this is an option for you, but it may be worth looking at something like this (credit @DanG_dup_78):
https://community.alteryx.com/t5/Analytics-Blog/Alteryx-and-Microsoft-R-Server-Demo/ba-p/57513
... or if you're on an open-source stack, look for a similar cluster-ready R Server
Sorry I can't answer more specifically, I hope this gives you a line of attack!
I would consider rethinking the problem. Is it possible that you could engineer the macro to run without being batched? In some cases, there are multiple ways around the batch macro and that would in itself speed up your process. Running in parallel will only speed "part" of your process up as other parts will remain serial (still fast). There will be limits to your CPU and Memory which will prevent you from running massively parallel.
If you would like to review your macro with me, you can PM me with your availability and email and I'll see where I might be able to help. Otherwise, you might want to post your macro or a macro with similar function so that others may comment.
Cheers,
Mark
Hello,
I've merged two duplicate posts together here. One post on this topic appears to have inadvertently been posted in ' Welcome to Community'
Thanks!
@SeanAdams wrote:Hey @rohanonline
I'll openly admit that your question is beyond my depth - but my first attempt would be to look for options for how to farm this out to a process that can run in parallel (i.e. get it off your desktop / server, and into a cluster).
Not sure if this is an option for you, but it may be worth looking at something like this (credit @DanG_dup_78):
https://community.alteryx.com/t5/Analytics-Blog/Alteryx-and-Microsoft-R-Server-Demo/ba-p/57513
... or if you're on an open-source stack, look for a similar cluster-ready R Server
Sorry I can't answer more specifically, I hope this gives you a line of attack!
Thanks Sean. I will look for the feasibility for a cluster.
Thanks @MarqueeCrew. I have uploaded a sample in the 4th post in the discussion thread. I was unable to edit my original post to include the example.
Thanks for your offer to help.
That's a lot of model building you've got there!
I suppose that my work-around could apply if you were an R programmer. If the primary UPC was read in to the R code (custom), then instead of running this in batch, it could be done within the R tool. Whether or not this speeds things up is left to be seen. I would think that it would run faster, but wouldn't bet too heavily on it.
Sorry that I couldn't be of more assistance. Maybe this is an idea for Alteryx?
Cheers,
Mark
Hey @rohanonline,
Not sure if you got to a solution here either under your own power, or with some of the other folk? Reason for asking is that I'm wondering if it would be better for you to close this thread out with a one-line note about how you worked around this, or solved it; or if this should be closed out (just give yourself the solution credit) and moved to the Ideas section as an idea for scalable parallel processing on a cluster?