Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.
Free Trial

Alteryx Server Ideas

Share your Server product ideas - we're listening!
Submitting an Idea?

Be sure to review our Idea Submission Guidelines for more information!

Submission Guidelines

Partitioning queries to run on multiple workers

We get a lot of comparisons with DataStage ETL and what Alteryx can do. DataStage has the capability to partition queries and run on multiple worker nodes to allow the execution to be more efficient and faster than if it ran on one worker. We want similar capability on Alteryx as our case study shows right now that Alteryx is 10X slower than a query ran on DataStage through the GRID.

17 Comments
KoryC
Alteryx Alumni (Retired)

Hi @htkonkl - thank you for posting the idea. While there are some possible ways of accomplishing this (breaking out parts of the workflow into individual workflows, and running those through the server), you're correct that we don't have any auto-partitioning out of the box today. We don't specifically have this targeted at the moment to solve, but I've updated the idea to "comments requested" and looking forward to seeing comments from others wanting this as well.

KoryC
Alteryx Alumni (Retired)
Status changed to: Comments Requested
 
SeanAdams
17 - Castor
17 - Castor

This makes a lot of sense, especially for very large loads.   We've done this the hard way (by breaking the data set down into 10k row chunks), but this would be much easier if this could be broken down automatically.

Not sure if this is on the roadmap for the new engine, given that the new engine is designed to be multi-threaded at core?

Atabarezz
13 - Pulsar

looking forward for E2...

Atabarezz
13 - Pulsar

If the data resides on db, we may have to separate and work the data there... How about that?

fharper
12 - Quasar

I would like details on the case study.

 

Correct me if I am wrong but DataStage ETL is a platform from IBM for ETL and typically is deployed with GRID architecture and may or may not have massively parallel DB (MPPDB) under the covers.

I believe Datastage provides ETL tools/language features to build and deploy jobs to do work, the GRID is a resource and workload management system to run those jobs.  the Grid will effectively run a job where resources, cpu & memory mainly, are most available so a job may not need to wait in queue to actually start running and gets as much cpu and memory as it can in the shared environment.  this is a very variable thing.

 

The main thing in query speed will be the DBMS.  This should be common to both Alteryx and Datastage for fair comparison.  Neither product has much influence on query speed once submitted to the backend.  Some drivers are better, yes, but the heavy lifted is in the DBMS.

 

We use IBM's IDAA, a MPPDB, and DB2 and Oracle and SqlServer.  Some queries we do in the intelligence area basically drive table scans in DB2 and Oracle and run in 3 hours for example.  The same query run against IDAA returns in 10 minutes.  At first we saw no advantage, we thought we had the wrong drivers or something...then I scrolled down on the input tool query screen and paid attention and saw the "Pre SQL" statement area.  To get the query into IDAA we had to include a pre-sql statement that tells DB2 to redirect the query to IDAA platform.  Then we see blazing speed.

 

In your use case was it the exact same SQL code?  Is there a pre-sql statement telling your backend to direct the query to a massively parallel DB or is the ALteryx version not even pointing at the MPPDB?  For us the connection information is the same but the pre-sql statement is what is critical to direct it to IDAA.  If your system is not using IDAA then it may have the same requirement or it may have a different requirement but it is not likely a difference in the connection definition.

 

We don't have Alteryx Server but the one thing I would want to see is the ability to use more than one server to run work on and a robust ability to manage the workload on those resources.  This effectively addresses the GRID advantage if we limit scope to Alteryx work because I think Alteryx workload is limited to managing Alteryx jobs only so it will do a good job for Alteryx jobs but will be impacted by other work on same system that may be running...I like dedicated system for Alteryx to more effectively manage memory and cpu.

 

 

fharper
12 - Quasar

To continue I never heard of a system that partition queries.  I would like that explained.  MPPDBs partition the data in to many small slices and stores them on separate nodes which are independent servers with direct storage for their slice of the data.  Datastage or Alteryx or any other tool submits the query and the MPPDB replicates the query to each relevant node, it does not partition the query.

 

If we are dealing with apples and oranges on backends then no comparison is meaningful.

 

If we are not blessed with a MPPDB then there are ways to code semi-dynamic and dynamic flows to break up the workload associated with a query.  This is not unique to a given tool but I find it very easy to do with Alteryx, and this varies on the nature of the data.  regardless of tool if you have to hit a legacy or non MPPDB you do a simple query to get the list of population keys, then break it into groups based on algorithm that makes sense based on data and system limits.  Now you have the population keys in workable groups maybe written to a series of files, then submit the same workflow to actually run the main query and downstream work multiple times in parallel.  then when last one finishes you have a final flow combine and package results.

 

We have cases with clients and a variety of elements in between so we can run the same query in the same flow but run multiple times in parallel but fed a different client as a key in the filtering logic of the query which is dynamically modified in the dynamic input tool.  or you get a list of cases for a client or clients and chop the case list into 10k groups and wrap into "'" and "," and embed into where clause, all dynamic as described before. write the code once in a dynamic fashion but break it into a couple flows or more, one to get the selection list and form it into something to replace into the next query in the next job, run that many times in parallel, then join the results for final processing.

 

If Alteryx delivers multi-threading as I hope this becomes physically simpler to a degree but slightly less dynamic in potential as you simply build separate cloned paths for the middle part to run the dynamic query and union the outputs of those paths for final processing in the same flow.  at a high level this is what we do for some queries against legacy DBs that have the potential to result in table scans.

 

 

MattB
Alteryx Alumni (Retired)

I love the dialog! Please feel free to continue. Just a couple comments from the Product side...please do stayed tuned for next generation of the Alteryx engine. It has shown some incredible performance enhancements by taking better advantage of multi-core systems. In addition, we are looking to drive even more performance in Alteryx Server by considering architectural changes. We already have tons of great ideas from customers to reference.

 

Please reach out to @MalaG if you want to hear more about the next generation of the Alteryx engine.

MattB
Alteryx Alumni (Retired)
Status changed to: Under Review

Updating status

MalaG
Alteryx Alumni (Retired)

@Atabarezz - We have launched the e2 Tech Preview. Please let me know, if you would be interested in participating in our e2 Tech Preview and I can share the details with you.