Alteryx Designer Knowledge Base

Definitive answers from Designer experts.
Don't forget to submit your entry for the Excellence Awards by October 30! | Need more information about the program? Check out the blog here

Splitting records into smaller chunks to make a workflow process quicker.

Alteryx
Alteryx
Created

 

Sometimes large amounts of data can 'overwhelm' a tool or process in your workflow and make it appear like it is stuck or frozen.  This was the case recently when a user attempted to pass 7 million rows of data to a CASS tool.  Splitting these records into smaller chunks makes the process run much quicker. 

 

In this example, we'll discuss how you can control the number of records that pass to a batch macro.  Think of a batch macro as a process that runs multiple times until all records have been processed.  The 'process' is simply a workflow you want to run with the addition of a control parameter.  

 

In the attached example, we have an input file of 16 records in a single column called 'Address1'.  In reality, this file could be much larger, containing millions of rows and many columns.  We will group our data so each group contains 5 records (this should give us a total of 4 groups where the last group only has 1 record in it).  From there, we will send 5 records at a time to a process that counts the number of times the process runs/iterates (if all goes well, the process will iterate for each group so we should see 4 iterations).  The process itself is simple.  It selects a specific field using a Select tool so we only get the information we need (Address1) and the Select tool connects to a Formula tool (If you're using the Formula tool in v11.0+, be sure to familiarize yourself with the tool's interface redesign!) that creates a field called 'IterationID'.

 

The first step is to bring in your input file.  In our case, it is a text file with 16 records and a single column.

BatchGroup01.png

 

BatchGroup02.png

 

Connect a Record ID tool (If you're using the Record ID tool in v11.0+, be sure to familiarize yourself with the tool's interface redesign!) to the Text Input.  We will give the first group a starting value of 0.

 BatchGroup03.png

 

BatchGroup04.png

 

Connect a Formula tool to the Record ID tool.

BatchGroup05.png

 

Create a new field called 'GroupID' and make the data type numeric (Int32 in this example).  For the expression, write FLOOR(TONUMBER([RecordID])/5).  This changes the data type for 'RecordID' to numberic and the keyword FLOOR ensures the number will be an interger.  Divide by 5 so each group will have no more than 5 records in a group.  In your real-world example, you may need to play around with that number to determine which is quickest for your workflow.  If you have millions of records in your file, a good place to start may be to replace 5 with 10000 or 5000.

 

BatchGroup06.png

 

 

Connect a Summarize tool to the Formula tool to get a list of unique GroupIDs.

BatchGroup07.png

 

 

BatchGroup08.png

 

 

Connect a Select tool to the Formula tool as shown here (both Summary and Select tools are connect to the Formula tool).

BatchGroup09.png

 

At this point, we no longer need RecordID, so unchecked that field.

 

BatchGroup10.png

 

 

We have created our 4 groups of no more than 5 records.  Now let's discuss the process we want to use as a batch macros that will reach each of these 4 groups, one at a time, until all 4 have looped through the process.  This process just selects a specific record using a Select tool.

 

BatchGroup11.png

 

We only want Address1.  The GroupID will no longer be needed in after this point.

 

BatchGroup12.png

 

Connect a Formula tool to the Select tool.  Create a new field called 'InterationID' and make the data type numeric.  In the 'Constants' section below, notice there is something called 'Engine.InterationNumber'.  This constant is the expression and what it does is create a running total for how many times our process iterates. 

 

BatchGroup13.png

 

To make the process above a batch macro, add a Control Parameter tool to work area.  Connect a Macro Input tool before the Selection tool and a Macro Output tool after the Formula tool as shown below.

 

BatchGroup14.png

 

Enter 'GroupID' as the label for the Control Parameter.

 

BatchGroup15.png

 

For the Macro Input tool, you need to provide a template for that the data will look like as data gets passed to the Selection tool.  Click on 'Edit Data' under 'Text Input'.

 

BatchGroup16.png

 

 The data will have a field called 'Address1' and 'GroupID'.  Enter your data so it looks like this:

 

BatchGroup17.png

 

Now save your batch macro, remembering where you saved it. 

 

Go back to your workflow where you created the groups.

 

BatchGroup09.png

 

Connect your macro you created to the Summary and Select tools like this (you can add a browse after the macro):

 

BatchGroup21.png

 

The best way to add your macro to the work area is to right click --> Insert --> Macro and select the macro where you saved it.

 

BatchGroup22.png

 

Select your macro.

 

BatchGroup23.png

 

To configure your macro, there are two tabs in the Configuration window.

 

BatchGroup18.png

 

Configure the Group By tab like this:

 BatchGroup19.png

 

And the Questions tab like this:

 

BatchGroup20.png

 

If you haven't already, you can add a Browse tool or an Output tool to your macro tool.

 

BatchGroup21.png

 Looking at the output from the Browse tool, we see this:

 

BatchGroup24.png

 

Which shows 5 records were run through the first, second, and third iterations, and the fourth iteration processed a single record, which is what we expected to have happened.  Use this approach and engineer it to your specific situation to improve the overall efficiency of your workflow whenever you have an input file that is very large.  It may help some of your tools and processes run quicker in the long run.

Attachments
Comments
5 - Atom

Thanks, works perfectly.

8 - Asteroid

is there an option to combine multiple records into 1 and then maybe submit a API request @WayneW

8 - Asteroid

@WayneW  i was processing 41 Million records through the US geocoder with the CASS option and the process ran for about 20 hours before i killed it as it was consuming all the CPU. i had done a small test of 200k records which took 2 minutes to process. By that estimate it should have taken a little under 5 hours for it to process the data but it ran forever.

 

 

do you know whats the optimum batch size for the cass tool would be if i use the batch macro and i can improve on the performance from 20 hours to something less. 

5 - Atom

Could the tile tool not be used at the beginning, rather than using a formula FLOOR formula?