Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

workflow runtime bottleneck

talc_87
6 - Meteoroid

Hi

We have a problem with Alteryx workflow runtime

 

My team preforms ad hoc data analysis,

when the user ask for a specific analysis for one time only or at a very low frequency (let’s say once a year).

Usually we received the gross data on txt or csv files.

 

Our current solution is Alteryx  designer which installed on the team laptops.

The problem is that the data is keep growing that 5 millions records is not that rare.

Since the dataset volume is bit, the workflow runtime is long. There are cases we should let the workflow run overnight.

 

We convert the csv/txt gross data to Alteryx Database file format. It helps but it will still need to run overnight or a few hours.

 

We are during the process of onboarding Alteryx server,but i don’t think it will solve the runtime problem. We will have to create the Workflow, test it, run the entire flow and search for bugs etc.

All done locally and then migrate it to prod. All for one time and we will still need to run the workflow locally several times while testing is.

Is this correct or I misunderstand the of Alteryx server concept?

 

I thought about uploading the gross data into Azure DB or other cloud DB, and import the data to the workflow from the DB and not from the csv/txt/ alteryx DB file.

Will it decrease the running time?

 

I am looking for an Alteryx based solution which the team will create the workflow, click the run button and see the results in minutes.

 

Any other ideas to solve the runtime problem?

 

thank you, Tal

4 REPLIES 4
mceleavey
17 - Castor
17 - Castor

Hi @talc_87 ,

 

5 million rows is small data and Alteryx should have no problem churning that. It sounds like you need to check the specs of your machines and the allocated RAM to ensure you're running with enough resources.

You also need to check the actual design of your workflow to ensure you're not doing something which will greatly slow down the processing speed.

Another thing to consider is pulling the data in and compiling it every Month (for example) into something like a Calgary database. With that you can query billions of rows in seconds.

 

There could be many things at play here.

 

If you could share your workflow and some mock data I can have a look for you.

 

M.



Bulien

talc_87
6 - Meteoroid

thanks!

So uploading the data to a cloud DB won't solve the problem?

 

my machine spec is:

Intel i5-8350 1.7GHz

16GB RAM

 

I input an Alteryx DB file which has 34,800,000 records and 16 fields. The file weighs 1.17 GB.

it took on average 15-20 sec.

How can I know if it's reasonable?

are there any recommended timelines that I can compare my machine to?

 

Thanks, Tal

 

danilang
19 - Altair
19 - Altair

Hi @talc_87 

 

Having the results in minutes just might not be possible, depending on how complex the workflow is and how much data is being manipulated.  However, there are some steps that you can take to make it as fast as possible.

 

If you need to read all the data, accessing the data from a .yxdb file that is stored locally on your machine is the fastest way.  Any other format, csv, excel will take longer and storing the data in the cloud will increase this input time even further.  

 

You mention that the reading the 34M records from the .yxdb takes about 15-20 seconds.  This sounds like a reasonable amount of time for that much data.  

 

You also state that workflow runs for hours.  If this is the case the bottleneck isn't the input but the operations that you're doing on the data once it's loaded.  There are some questions that you can ask to get the run time as low as possible

- Do you need to process all the data every run.  Can you get by with only processing the latest week/month/year?

-Can you store the summarized output for each run and use these in conjunction with the new data.  For instance, if you're interested in monthly sales statistics, averages, totals, etc, you can store these so that they don't have to be recalculated from the raw data in each run.  During the current run, you load the most recent data and the summarized older data and and store the new summarized data.

-Maybe the algorithm that you're using can be optimized.  Look for any cross joins that may be occurring in your data. etc.

 

Since your requests are ad hoc, having an Alteryx server won't help in the process.  One thing that might help is putting your data into a powerful database server.  In this case you can use the In-db tools to spread the processing load between the remote database server and your local machine   

 

As far as machine specs, yours fall on the low side.  With large datasets, you want the data to be in memory as much as possible.  If you don't have enough RAM, the excess required space will be stored on your hard disk and read/writes to disk are much slower than memory read/writes.  With large data sets you should get at least  32GB RAM, if not more.   Processing speed is also a factor so, you want at least an I7 or faster class processor running as fast as possible.  My work machine is an I7 3.6GHz with 32GB RAM. 

 

Dan

 

talc_87
6 - Meteoroid

Hi

Thank you!

Labels