Alteryx Designer Desktop Ideas

sandeep_chayanam · ‎09-22-2015

Tools within a workflow needs to be able to run in parallel whereever applicable.

For example: Extracting 10 million rows from one source, 12 million rows from a different source to perform blending.

currently the order of execution is the order in which tools are dragged into the canvas. Hence Source1 first, Source2 second and then the JOIN.

Here Source1 & Source2 are completely independent, hence can be run in parallel. Thus saving the workflow execution time.

Execution time is quite crucial when you have tight data loading window.

Hopefully alteryx considers this in the next release!

fharper · ‎03-09-2017

Unfortunately I don't know if they are looking at this seriously. I raised it in the forum and in direct conversation with some Alteryx Reps and in one of the Tampa UG meetings. I think it caught some interest but have not received any feedback. The post says it is "Under Review" but that could mean almost anything.

Maybe they will see this exchange and provide us a status. One thing I know is they consider popularity as a factor in prioritization. So if you talk this up with other users and they add their interest on this post it will get more visibility and possibly gain in priority.

If you find other posts out there of the same basic nature post to them as well and mention this post in them.

BenG · ‎03-09-2017

Hi Everyone,

Thanks for the great discussion on this topic!

There are two different topics being covered here that are very important to us. First is the idea of bringing in multiple data sources in parallel. This is something that is becoming more important as data sources grow and many data sources are cloud based or remote. Often reading from files on disk can not be made any faster by reading multiple files at once, but it is something that is becoming more feasible as more environments move to SSDs.

The second topic is that of running a workflow in parallel. We are conducting some research on this topic to see what is feasible. One interesting question that has come up as part of this is whether record order is important when the data is not being specifically sorted. If I pull records from a DB, then filter, then write to a file, am I expecting the records written to be in a certain order? What if those records come out in a different order every time the workflow runs? We will make decisions on these topics, but would be good to get an idea from the community how important these things are. Also, there are some tools that have an inherit sort, such as the sample tool when using the Group By option. In this scenario, we might have to add a new setting for the "sequencing" field so that we combine records correctly for the sample. Does this make sense? Are there other ways you would describe that kind of setting?

Thanks,
Ben

fharper · ‎03-09-2017

BenG

In a way there are two topics but from our flow it is really one topic...the ability to run two or more tool paths within a given workflow simultaneously. The two separate inputs were an example of the most common application of this ability and that which lends to breaking up a workflow into multiple workflows to externally accomplish parallel processing benefits.

But the real request in my view is the basic ability to multi-thread, aka parallel process, down 2 or more paths within a workflow.

Example with inputs takes 45 minutes within a single workflow:

I read data from a Database with SQL or it could be a flat file reading sequential, doesn't matter, and this read takes 10 minutes
- I then do some transforms and cleansing which takes 8 minutes
I read data from a different source and this read takes 12 minutes
I then join these two paths and do some final processing and write output which takes 15 minutes

I can insure the output in a given order within the SQL with order by or I can put a sort tool in the path after the read so I don't see order of data, you mentioned, as an issue.

In the above example we would like Alteryx to support the two input paths running concurrently so the net wall clock time to process those two steps is 18 minutes, the longest of the two paths before join, instead of the 30 minutes required to do them serially.

I currently accomplish this by making this one workflow into 3 workflows and running the first two that read and initially process those data concurrently, then once finished I start the 3rd workflow which is the join and final processing. While this is effective on long running flows it is not as efficient as it would be if I could do parallel processing within a workflow as the time to write output from the first 2 flows and the time to read them back in with the 3rd workflow eats up some of the time savings of breaking them up. Plus if you don't have a good automation system you will often miss starting the 3rd flow in a timely manner, doing so manually.

I built a robust scheduler for this and other needs that eliminates delay in starting a process after another finishes but the ability to do parallel within a single flow is still better. Also the next example illustrates a value that is not often easy to break up into multiple flows and get good efficiency

Example with single input takes 54 minutes within a single workflow:

I read data from a source and this read takes 12 minutes
I then do transforms and cleansing taking 5 minutes
I then connect to 2 separate process paths
- Path one gets the full stream of data and sorts it one way and does some multi-row tool processing to accomplish a specific result leveraging the sequential nature of the processing. this takes 10 minutes
- Path two does similar but in a different sort order to accomplish a different result. this takes 12 minutes
I then join these two paths and do some final processing and write output which takes 15 minutes

In the above example if I could process the 2 paths in the middle at the same time then I could shrink my processing time by 10 minutes.

There are a number of other examples we see in our arsenal of flows but these 2 provide enough to show the focus is in parallel processing.

I know that the number of parallel processes one could support, like multi-threading on CPUs, is limited by resources. based on what we do I see the greatest value in being able to parallel 2 or 3 paths. We have conceived but a few flows where it would benefit greatly to have more threads than 2 or 3. It may be an easy nut to crack with that thought in mind but I expect once you figure it out then the number is not so a limit but rather the resource...that said I can see where understanding the resource available may be a challenge in such a dynamic work so setting a limit may have great value in quickly delivering a solution versus a fully dynamic one.

After such a long response I ask your indulgence for a little longer. Another enhancement I have asked for in conversations and posts is the ability to connect 1) workflows and 2) tools with "pipes". As I mentioned in the first example some of the benefit or breaking up a workflow to be able to parallel process some portion is lost in the writing of output by one flow and the subsequent read by the next workflow. On mainframes there is a product or feature called "batch pipes" that allows to separate programs to logically connect and processing nearly simultaneously. the output of the write job does not physically write the file to a hard drive but passes from its output buffer, via the pipe connection, to the other programs input buffer. they share data in real time with the reader processing data the writer just sent out of its buffer downstream.

The use of batch pipes in batch systems dominated by sequential processes has save lots of money in processing costs and huge amounts or wall clock time. this would be a huge value for Alteryx users as so many us do use Alteryx to process much of our work in flat file form, even when we connect to a database it is often for extraction and we subsequently manipulate, transform and cleanse. The latter being all sequential processing.

I hope you consider this as well.

chadanaber · ‎03-10-2017

I am currently more interested in the parallel input enhancement to Alteryx. However, in regards to running a workflow in parallel, I think you will need to add a configuration for multi-input tools that states whether you need to allow for all inputs to complete before executing the multi-input tool. Something along the lines of "Execute as a batch" vs. "Execute as streaming". There is an inherent cost to doing sorts/uniques, or anything that requires a "batch" view of the data, and that should be taken into account by the developer.

Inactive User · ‎04-05-2017

I'm having issue with Alteryx Designer maxing out a single CPU core while the remaining cores are all idle. I'm not sure why it doesn't make use of all the available cores, this is on a quad core i7 processor. Basically it's appears to only be making use of 25% of CPU. Would this be related to this discussion or a separate issue?

fharper · ‎04-06-2017

I would say this is a different issue and perhaps not an issue at all but I leave that to more knowledgeable people at Alteryx.

In my experience running on i5 quad core I have peaked above 25% but only on data modeling workflows. Alteryx is not normally cpu intensive unless modeling. most non-data-model workflows bounce between 1 and 15% cpu and generally under 10% but will often suck up large amounts of memory depending on what you are doing. You would need to be doing lots and lots of calculations to go over 25% cpu and that generally only happens when data modeling.

travis · ‎04-18-2017

This feature would be highly desirable for me as well. Both being able to input from the cloud in parallel and output.

The_Dev_Kev_Env · ‎05-22-2017

In my opinion, this is an extremely important feature to put on the roadmap. I have recommended Alteryx more and more for "Enterprise" ETL processes and this is always a bottleneck in those solutions. We create a generic template that utilizes configuration tables, but will ultimately run the macros in batch as opposed to parallel.

Closet solution we created was to wrap this in an Alteryx workflow to read config values, write to X different xml files, write to a .bat file that calls the alteryx.exe X times with the different xml files indicating the specific parameter values (NOTE: need to make use of "Start" command to ensure parallel runs), and then clean-up excess files no longer needed. Not very elegant and requires extra custom work to maintain/log.

I believe there is a lot of potential for Alteryx in these projects, as well as a lot of room for the product to grow here. I plan on trying to get a few more ideas out on the forums this month about this!

Best,

dK

badun · ‎06-09-2017

Hi Alteryx support!

Are there any updates from that perspective? It's really annoying to wait 2-4 hour for AlteryxEngineCmd.exe running only on one core just for reading >50M rows from yxdb.

Thanks

fharper · ‎06-09-2017

At Inspire this week there was talk of full multi-core utilization among other great advances coming. But of course the question becomes when? I get a sense of a major release in Q1 but we need to get Alteryx to talk timelines. While cpu parallel processing or multi-threading in the machine will be very welcome when it is available it is a tangent to the topic.

I would just reiterate the topic was focused on was the ability to run two or more tool paths within a given workflow simultaneously, regardless of how it is accomplished technically.

Alteryx Designer Desktop Ideas

Submitting an Idea?

Ability to execute tools in parallel within the same workflow