Engine Works

jnorton4 · ‎01-14-2021

Count Records and Test Tools

Have you ever built a workflow only to find out later that you’re losing a bunch of your records after trying to Join a couple files together? Do you ever wish you could check the number of records as they pass through your workflow, and be alerted when records get dropped? Or even stop the workflow when there’s a value that is too high or too low? With Alteryx, you can set up these checkpoints and alerts, and focus on your project instead of spend time trying to find where your workflow is leaking records. You can use a Count Records tool where you need to check the number of records, or use a Test tool when you want to check to see if the number of records matches a value or meets some condition.

What we will use:

Alteryx’s Test tool
Alteryx’s Count Records tool

An example workflow is attached to this post (shown below); simply open into Alteryx and run it.

In the screen shot above, I’m using both the Test tool and the Count Cases tool for three use cases:

As a check to see if we’re losing records after a Join
To make sure no more than 5% of our employees aren’t working too many or too few hours
To make sure the number of records at the end of the workflow matches the count from earlier points in the workflow.

First Use Case: Ensuring No Records Get Left Behind After a Join

When you join tables together, you don’t expect to lose cases that you thought were going to match. Usually, you expect all the cases in one file to match with a case in another. But what happens when some IDs in one file are entered incorrectly? You’ll risk being unable to match those cases, and you will lose them without even knowing it, which could pose quite a problem! This is something that can happen when you use the Join tool and is something we want to avoid. So, I’ve found it helpful in these cases to set up an alert with the Test tool which will tell me if the number of cases that fail to join is greater than zero; if any cases flow through the left or right output anchor of a Join tool, that will trigger the alert.

The example I have comes from the HR space. Workers at a company responded to an employee satisfaction survey, and the results are stored in 2 files – one stores the answers to “Strongly Agree – Strongly Disagree” type questions. The other file contains their comments to an open-ended question. We need to match the comments to the right employees so we can do some more analyses on the combined data. And to do that, we will use a Join tool to match records by Employee ID.

Here’s how to do it:

Attach a Test tool to the left output anchor of the Join tool and in the Configuration Window at the left, click Add. (You can also edit and delete existing tests).
- Give the test a name
- Specify the Input connection; although the Test tool can accept more than 1 connection, here there is only one, #1.
- Choose Record Count Matches Specific Value This test will throw an alert when the record count does not match the value you enter in the Test Value field below it.
- Enter 0 as the test value and click on OK.
Do the same thing for right output anchor of the Join tool; attach a Test tool to that anchor and configure it the same way we just did above.

Notice here that there are 332 records in the Employee Satisfaction data, but we have only 309 records of employee comments. The Join matched 309 comments with the employee survey responses. But we also lost 23 survey records from the Left output anchor; not all employees provided a comment. The Test tools we just set up will alert us if the record count coming through either output anchor is not 0. And because there were 23 records coming through the Left output anchor of the Join, the Test throws an alert[i].

(Note: In the attached workflow, you’ll notice I’ve included an identical workflow in a separate container, but one where there are no records lost in the Join. There are 332 employee records, and 332 employee comments. All records match, so no alerts are thrown. To view this, open the container under the workflow and you can use the shortcut Ctrl-R to re-run the whole thing; you’ll see there are no lost records, and so no alert gets thrown.

Second Use Case: Alert Me When More Than 5% of Employees Are Working Too Many or Too Few Hours

In this second use case, I want to be alerted when 5% or more of my employees are working either too many hours (or too few) compared with the average number of hours worked by their department. If I have too many employees working too many hours, that would suggest I need more employees; if too many employees are working too few hours, that would suggest an opportunity to train them to increase their skill set and cover more tasks.

So, imagine that you need to be alerted when a condition or comparison – maybe like the one in this use case – needs to be met, but fails. Again, you can use the Test tool to do this; in addition to testing for one value like we just did, the Test tool can also determine if more complex conditions are met.

For this example, we want to do two things:

Throw an alert if the number of employees report being overworked is more than 5%
Throw an alert if the number of employees report being underworked is more than 5%.

The following part gets a bit technical; Alteryx users who are familiar with data prep might want to skip to the next section about using the Test tool to test for a condition.

First, we need to calculate the values for over- and underutilization, by job type. I’ll define an overworked employee as someone who works more hours than twice the standard deviation above the average for their job type, and an underworked employee is someone who work less hours than twice the standard deviation below the average for their job type:

Using the same workflow we just started, attach a Summarize tool to the J output anchor of the Join tool.
- Group by CurrentPosition
- Select HoursWorked and request the Average
- Select HoursWorked again, but get the Standard Deviation
Place a Join tool after the Summarize and connect the Summarize output anchor to the R input anchor of the Join tool. Connect J output anchor of the first Join tool to the R input anchor of this Join tool. Select CurrentPosition as the field to join by (you can also deselect the CurrentPosition from the R input, so we don’t have 2 identical columns).
Next, add a Formula tool to the J output anchor of the Join tool and configure as shown below:

Now, attach a Summarize tool to the Formula and group by Utilization – which we just created - and request Count.
Next, attach a Cross Tab tool to the Summarize tool and choose Utilization for the column headers and Count for the values. Choose First as the method for aggregating values.
Finally, to calculate that 5% threshold of the workforce, attach a Formula tool after the Cross Tab and configure as shown below:

Use the Test Tool to Alert Us When Our Condition is Met

Now we can test to see if 5% or more of employees are working too many or too few hours. So, let’s attach two Test tools to the Formula tool and configure one to add a test called OverUtilization. Choose the test type for “Expression is True for First Record” and we’ll define the Test value as [OverUtilized] <= [Threshold]. Let’s do the same for the other Test tool, but here we’ll define the test value as [UnderUtilized] <= [Threshold].

Neither of tests throws an alert at this point, so we don’t have to worry that there’ll be an impact on headcount, which is a good thing! So, now we can add an Output Data tool to save the file and use it later for more analyses.

So, just as we did in this second use case, in a similar way you can use it to test for any number of conditions you want, and whenever you need to make sure that the number of records passing through your workflow doesn’t exceed a defined condition, or maybe does meet a defined comparison.

Third Use Case: Ensure That the Number Of Cases At The End Of Your Workflow Matches The Count From Earlier Points

For this last use case, we need to be sure that no records were unexpectedly dropped at different points along the line. We need to be sure our record counts are what we expect them to be. So, I’ll use several Count Records tools placed at different points along our workflow and use a Test tool to check if the final record count equals the number of records counted at an earlier point in the workflow (shown below).

Notice in the sample workflow that I use wireless connections[ii] for each of the Count Records tools in the workflow. The first Count Records tool (at the top) is connected to the Input Data tool for the Employee Data, which is at the very beginning of the workflow.

The Count Records tool is so easy to use, you don’t even need to configure it. But be careful, because it outputs only one cell of data – the record count – and names that column Count. In my example I have six Count Records tools so I will need to attach a Select tool after each one so I can rename Count to something like UploadedEmployeeRecords which describes the record count when I first uploaded the employees’ survey data.

Now let’s use the Join Multiple tool to join by record position and combine all record counts into one row of data.

Your results will look like the screen shot below:

So, you can see that we uploaded 309 comments and 332 survey responses, and then joined the two files by matching on Employee ID. Three Hundred and nine records joined successfully, but we also lost 23 records from the survey data file. Well, this makes sense because remember that in the top workflow example, there were 23 employees who did not offer a comment[iii].

Now, if we want to check the final record count and compare it to the count earlier in the workflow, you can add a Test tool to Join Multiple tool, choose test type for Expression is True for the First Record and use [SuccessfullyJoined] = [FinalOutputRecordCount] as the Test value. If the record counts at these two points are different, the rule will throw an alert. And you can do this sort of value checking throughout your own workflows, to ensure that the record counts at the start, throughout and at the end are what you expect them to be or alert you if they are not.

________________________________________

[i] If you wanted to stop your workflow from running completely when an alert is thrown, you can click the checkbox in the Runtime settings of your workflow configuration for Cancel Running Workflow on Error.

[ii] More about wireless connections here: https://community.alteryx.com/t5/Alteryx-Designer-Ideas/Wireless-Connections/idi-p/1367

[iii] If you needed those 23 records anyway, you could use a Union tool to merge the 23 cases back into the data, and their values for the Comment column would be null.