Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!
The Product Idea boards have gotten an update to better integrate them within our Product team's idea cycle! However this update does have a few unique behaviors, if you have any questions about them check out our FAQ.

Alteryx Server Ideas

Share your Server product ideas - we're listening!
Submitting an Idea?

Be sure to review our Idea Submission Guidelines for more information!

Submission Guidelines

Partitioning queries to run on multiple workers

We get a lot of comparisons with DataStage ETL and what Alteryx can do. DataStage has the capability to partition queries and run on multiple worker nodes to allow the execution to be more efficient and faster than if it ran on one worker. We want similar capability on Alteryx as our case study shows right now that Alteryx is 10X slower than a query ran on DataStage through the GRID.

17 Comments
Atabarezz
13 - Pulsar

I would love to... see the lightning speed :)

fharper
12 - Quasar

I did some research on Datastage and it is a more complex beast.  I found nothing on partitioning queries which I initially took to mean partitioning the SQL.  It looks to leverage an underlying infrastructure you must invest time and money into setting up which defines an environment that allows you to design queries that more effectively attack Data that is partitioned in parallel.  Additionally it is designed to both extract and to write or update data stores in this way.  Alteryx leverages ODBC drivers and with InDB tools leverages some more sophisticated functions but I do not believe Alteryzx can take advantage of some features to load bulk data like , dropping tables and loading without indexing and such which are techniques leveraged when removing large sections of data or adding large amounts of data. 

 

So what I am seeing as differences in Alteryx and DataStage is the whole underlying infrastructure and meta-layers connecting and managing data connections and the features supporting parallelism in job execution.  , the latter being query replication...not query partitioning. With that in mind the Alteryx team should review some of the existing discussion on parallelism on the community.

https://community.alteryx.com/t5/Alteryx-Product-Ideas/Ability-to-execute-tools-in-parallel-within-t...

https://community.alteryx.com/t5/Alteryx-Connect-Gallery/Running-batch-macro-in-parallel/m-p/58678#M...

https://community.alteryx.com/t5/Alteryx-Product-Ideas/Parallel-container-amp-serial-container/idi-p...

 

I may not be tracking this right but if the main idea is to increase performance of Alteryx then the first place to look is the current serial nature of processing within workflows.  I automate my workflows with a scheduler I built that has the flexibility and variety required to manage many jobs with a variety of schedule criteria and their results.  this allows me to break up what would ordinarily be large workflows running serially through tools extracting from multiple sources with initial blending into smaller flows that extract and process their specific inputs and then when complete the automation launches the next flow to join and further process the data. so I reduce run time by introducing parallelism, usually by half and sometimes by more. 

 

In another case I have jobs where the flow is not "partitioned" or subdivided as previously described but where the data attack is partitioned.  This is similar to what I thought htkonkl was referencing but the language led me to believe Datastage dynamically portioned the actual query but it doesn't as far as I can tell.  In this scenario the flow may only have one input query and so serial tool processing of Alteryx is not the issue but rather the potential volume of data.  For this we logically/physically divide the attack on the data by dynamically altering the query to look for a smaller more manageable subset of the data.  I can see where DataStage facilitates this but it is not automatic per se. 

 

For example if I need to pull in more than 50% of a database it will likely table scan, after it wasted time dancing in the indexes. But if that data is even moderately spread across 6 divisions like customer code or geographic region then we make that a parameter to feed the query and make 6 records in a parameter file and feed a single flow as a batch macro or standalone flow which is faster than waiting for a table scan or even better we physically run the flow 6  times in parallel each getting its own parameter.  The latter approach is considerably faster than the former batch macro approach.  The latter approach would approximate what DataStage facilitates and can match speed in those scenarios I believe.  The problem is we have to manually design partitioning and you must have a good scheduler to execute the potentially many flows that would work in parallel and as predecessors., this is an advantage to Datastage especially in scheduling parallel instances of jobs I think.

 

One other thing that would greatly enhance throughput speed is designing a means to replicate what is called BatchPipes on a mainframe.  I redesigned a batch billing system some years back making it scalable and highly efficient leveraging 3 things,  

  1. a robust scheduler (common to mainframes)
    1. Need logging and results tracking and auto restart ability and dependency/predecessor features
  2. Symbolic substitution (to dynamically alter/control jobnames) 
    1. Ability to have one module with a specific name be dynamically renamed to an instance name, like module name with serial number suffix, and have instance name tracked in scheduler.
  3. Batch pipes
    1. The function of batchpipes is to join different jobs that share output as input. One is the writer and the other the reader.  In a standard environment the writer must completely finish and close the output file before the reader can start and open the file and start reading.  Yet if they shared a virtual connection, a pipe between a writer and one or more readers, then they can run in near parallel as the time delay is the time for the shared buffer to flush downstream.  This allowed us to reduce I/O and storage use significantly.
    2. This is something I have used on IBM mainframes and know was available on smaller IBM platforms but do not know if anyone built it for Windows and other MS platforms.

In terms of alteryx

  1. enhance the scheduler a bit
  2. Build a symbolic substitution feature into the Alteryx environment.  If you think about it this would be kind of like the Alias or "saved connection name" feature of ODBC connections.
    1. I can see a couple approaches but I am thinking to build feature into scheduler most likely to reference a module, as named, and dynamically rename according to a rule or definition for that scheduled item.  It is more the scheduler that needs to know where to get the code/module and then rename it and track it within the scope of execution and logging.
  3. I can only hope some of the genius types at Alteryx can find a way to build out a batchpipe system in the MS OS world.  Or go to IBM or maybe BMC who sold it to IBM and partner with them to build a small box version.
    1. I see this working as a means to
      1. run separate flows in virtual parallel  to reduce wall clock on flows with dependency relationships
      2. Allow macros to be run in virtual parallel to accelerate a flow
      3. possibly redesign some tools to leverage this within a flow to accelerate a flow
  4. One other parallelism item in the community I recall now is to open up multi-threading within a flow.  The idea here is a flow with 3 inputs can all read at the same time and downstream tools execute in parallel up to the point of join or union.  This would potentially address in a different way some of the above items.  However it may not be the ideal solution for managing resources compared to a more structured and intentional approach of a scheduler driven approach.  Perhaps better for standalone designers but in an automated environment especially on servers it may be harder to anticipate resource consumption when workload balancing hundreds or thousands of jobs.  I think it can be part of the total and very appealing but I would strive to deliver 1-3 above first perhaps.

This is all my take and I may be off track from what others above were driving at but I have not seen any response to my understanding of DataStage to correct it so I am thinking this is solid.

 

 And I have one last off-topic question...why are we limited to MS OS...it would be nice if Alteryx ran in UNIX/Linux which opens up the MAC/OS world aside from the fairly notable Linux and Unix market including vms on mainframes.  Just a thought

 

Julien_B
8 - Asteroid

I would :)

TonyaS
Alteryx
Alteryx

Hello, I am the TPM for Engine and we are currently doing an E2 Beta Program.

 

If anyone is still interested in enrolling, please let me know. 

Atabarezz
13 - Pulsar

please count me in @TonyaS 

KylieF
Alteryx Community Team
Alteryx Community Team
Status changed to: Under Review

Updating this idea's status back to Under Review, to be in line with past product updates.

KylieF
Alteryx Community Team
Alteryx Community Team
Status changed to: Not Planned

Thank you for posting to the Alteryx Community! Your idea is interesting to us, but we have determined that we are unable to place this idea on our road map for the product due to several factors. However, should we be able to return to your idea in the future we will update the status back to Under Review.