Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.
DataNath
17 - Castor
17 - Castor

Background

 

Most if not all of us have been guilty of writing large, hard-coded statements to filter our data, flooding our expression editors with an unreasonable amount of AND/OR operators, even if - at the back of our minds - we know it’s not best practice. Luckily, there are other ways of achieving the same result that can make things easier to implement and maintain. All three of these tips/methods involve the use of a lookup table in order to store our filtering values and so this can also help to make the process more dynamic, especially if we can populate these tables via a database or well-governed form/spreadsheet. It’s worth noting that, although we’re only looking at these examples in Alteryx, the concepts hold true regardless of the tool, meaning we can use the exact same techniques in SQL/Python and so on.

 

Filtering exact matches via joins

 

The first approach we’ll cover here is filtering for multiple discrete values in a column. An example of this would be isolating records relating to an origin airport of ‘Heathrow’ and ‘JFK’ from a dataset covering airline journeys. Whilst we could do this using a fairly simple filter expression along the lines of - [orig_airport] IN (‘Heathrow’, ‘JFK’) - what happens when we want to select records from, say, 75 airports? That starts to become a lot more hassle, right? Right. So that’s where our first method comes into play. Instead of typing out an enormous ‘IN’ statement, we can simply set up a lookup table and then conduct a Join on <field in main table> = <field in lookup table>.

 

Using a dataset of New York City restaurant inspections, let’s have a look at how that looks in practice. First we start with our main dataset:

 
1.png

 

And our lookup table which holds our list of discrete dates:

 

2.png

 

In order to narrow our records down to only those dates, all we need to do from here is conduct a join - in this case - on [INSPECTION DATE] = [FilterDates]. The result is our reduced data set:

 

3.png

 

Using a Summarize tool we can create a distinct list of our [INSPECTION DATE] field and confirm that we do indeed only have entries that match the [FilterDates] list:

 

4.png

 

As usual, any records from the main data table that don’t match the filter dates will fall out of the L output anchor. Inversely, filter dates that don’t find a match will show in the R output anchor.

 

Filtering for partial matches with Find Replace

 

Filtering exact matches through the use of a Join is all well and good. However, what happens when we want to narrow down our data by checking whether a lookup value simply appears within a field? Well, that’s where our next technique comes in. The long-winded/hardcoded approach to this would be typing an expression with a Contains() or REGEX_Match() function for every value we wanted to check. Again, this isn’t ideal and so we’ll take a look at the Find Replace method. Once more we start with the New York restaurant inspection data:

 

5.png

 

But this time, we want to check for certain things in the [VIOLATION DESCRIPTION] and therefore have a simple list of keywords in our [ContainsFilter] column of the lookup table:

 

6.png

 

To trim our data down based on the presence of these values, we first need to bring on a Find Replace tool and configure it like so:

 
7.png

 

If you’re wondering what on earth is going on here then don’t worry, here’s a quick rundown of each part:

  1. (Find) Any Part of Field: This means the tool will look for a match anywhere within the ‘Find Within Field’ field, rather than just the start, or an exact match, as the other options offer

  2. (Find Within Field) [VIOLATION DESCRIPTION]: Here we have just selected our target field i.e. the one that we want to look in to see if our lookup values are present

  3. (Find Value) [ContainsFilter]: This is where we choose the field in our lookup table that contains the lookup values we are searching for

  4. Case Insensitive Find: Fairly self-explanatory, this makes the search non-greedy i.e. HeLlO, HELLO and hello are all seen as the same by the tool

  5. (Append Field(s) to Record) [ContainsFilter]: You’ll notice this is a radio button option - as standard the tool is set to ‘Replace Found Text With Value:’ and, when a match to a lookup value is found, this would be replaced with another string that you’d provide in a replacement value field. For our scenario, we just need something to use for filtering and so we’ve elected for the ‘Append Field(s) to Record’ configuration, choosing [ContainsFilter] as the field. What this means is that, when a match is found in the [VIOLATION DESCRIPTION], the [ContainsFilter] text will be appended to the record. Crucially for us, if a match isn’t found, there will be a null record in the [ContainsFilter] field

 

We can see in the tool’s output where matches have/haven’t been found:

 
8.png

 

So now that we have this new field that we’re essentially using as a flag, we can simply employ a Filter tool to remove null records, as these don’t contain any of our lookup words of interest. The result of that Filter is now our nice, lean data set:

 

9.png

 

After the Filter, we just used a Select tool in order to remove the temporary flag column. One important thing to note is that the Find Replace tool only works with strings. Therefore, if you’re working with things like numeric data, you’ll need to do a data type conversion beforehand.

 

Dynamically create a Contains() list and Append Fields

 

Our final method can be used for the same purposes as those above; depending on your configuration, you can either look for partial matches, or both partial and exact matches simultaneously. For these examples, we’ll be returning to our list of dates that we want to isolate. However, this time we’re going to want to concatenate our lookup field into a separated list like so:

 
10.png

 

Once we’ve created the list, we then look to use the Append Fields tool in order to place this next to each record to be used for filtering. Users often shy away from Append Fields because of the risk of exploding our data (which is of course a genuine concern). However, when we’re appending a single value/cell, this isn’t a big deal. Anyway, that leaves us with something like this:

 
11.png

 

Now we move to the Filter step and this first option is greedy i.e. will allow both exact and partial matches to pass through. We start off by using the expression: Contains([LookupField],[TargetField]), which looks for our target field’s value within the large concatenated string we created earlier. By removing a bunch of columns we can easily see that the Filter has worked as expected:

 

12.png

 

Finally, we can apply an almost identical expression in order to ensure that we only get exact matches with this method. In order to make the search non-greedy, we simply need to wrap our values in something that makes them more specific i.e. quotes. In the Filter tool, we then ensure that we put the same character around our [TargetField]. That looks a bit like this when creating the list:

 
13.png

 

And when writing the Filter expression:

 
14.png

 

So how does this actually make things more precise? Well in this case, let’s say one of our dates had been entered incorrectly or somehow been trimmed and ended up looking like 05/02/2 instead of 05/02/2019 - when applying the first method i.e. without wrapping, we’re simply looking for whether 05/02/2 appears within 04/29/2017,05/02/2019,10/03/2017… which it does, even if it’s not a full match of that date. However, when we wrap each value, and do the same to the target value field, we end up looking for “05/02/2” in "04/29/2017","05/02/2019","10/03/2017"... which it doesn’t find because of the added quotes.

 

So there we have it! A handful of methods that you can use to make your life easier and save the headache of enormous expressions when filtering multiple values. Whilst these techniques are useful, setting up your lookup tables properly and maintaining them is essential to ensuring you don’t run into problems further down the line. As usual, we hope you have found this blog useful and please feel free to reach out with any feedback, criticism or suggestions for future content!

 

This article was originally published on Databasyx.com

Nathan Purvis
Data Engineer

Co-founder @ Databasyx.com | Inspire 2024 speaker | London User Group leader | Community Top Contributor | Consultant | Continuous learner | I like building and breaking stuff

Co-founder @ Databasyx.com | Inspire 2024 speaker | London User Group leader | Community Top Contributor | Consultant | Continuous learner | I like building and breaking stuff

Comments