- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Notify Moderator
Background
Most if not all of us have been guilty of writing large, hard-coded statements to filter our data, flooding our expression editors with an unreasonable amount of AND/OR operators, even if - at the back of our minds - we know it’s not best practice. Luckily, there are other ways of achieving the same result that can make things easier to implement and maintain. All three of these tips/methods involve the use of a lookup table in order to store our filtering values and so this can also help to make the process more dynamic, especially if we can populate these tables via a database or well-governed form/spreadsheet. It’s worth noting that, although we’re only looking at these examples in Alteryx, the concepts hold true regardless of the tool, meaning we can use the exact same techniques in SQL/Python and so on.
Filtering exact matches via joins
The first approach we’ll cover here is filtering for multiple discrete values in a column. An example of this would be isolating records relating to an origin airport of ‘Heathrow’ and ‘JFK’ from a dataset covering airline journeys. Whilst we could do this using a fairly simple filter expression along the lines of - [orig_airport] IN (‘Heathrow’, ‘JFK’)
- what happens when we want to select records from, say, 75 airports? That starts to become a lot more hassle, right? Right. So that’s where our first method comes into play. Instead of typing out an enormous ‘IN’ statement, we can simply set up a lookup table and then conduct a Join on <field in main table> = <field in lookup table>
.
Using a dataset of New York City restaurant inspections, let’s have a look at how that looks in practice. First we start with our main dataset:
And our lookup table which holds our list of discrete dates:
In order to narrow our records down to only those dates, all we need to do from here is conduct a join - in this case - on [INSPECTION DATE] = [FilterDates]
. The result is our reduced data set:
Using a Summarize tool we can create a distinct list of our [INSPECTION DATE]
field and confirm that we do indeed only have entries that match the [FilterDates]
list:
As usual, any records from the main data table that don’t match the filter dates will fall out of the L output anchor. Inversely, filter dates that don’t find a match will show in the R output anchor.
Filtering for partial matches with Find Replace
Filtering exact matches through the use of a Join is all well and good. However, what happens when we want to narrow down our data by checking whether a lookup value simply appears within a field? Well, that’s where our next technique comes in. The long-winded/hardcoded approach to this would be typing an expression with a Contains()
or REGEX_Match()
function for every value we wanted to check. Again, this isn’t ideal and so we’ll take a look at the Find Replace method. Once more we start with the New York restaurant inspection data:
But this time, we want to check for certain things in the [VIOLATION DESCRIPTION]
and therefore have a simple list of keywords in our [ContainsFilter]
column of the lookup table:
To trim our data down based on the presence of these values, we first need to bring on a Find Replace tool and configure it like so:
If you’re wondering what on earth is going on here then don’t worry, here’s a quick rundown of each part:
-
(Find) Any Part of Field: This means the tool will look for a match anywhere within the ‘Find Within Field’ field, rather than just the start, or an exact match, as the other options offer
-
(Find Within Field) [VIOLATION DESCRIPTION]: Here we have just selected our target field i.e. the one that we want to look in to see if our lookup values are present
-
(Find Value) [ContainsFilter]: This is where we choose the field in our lookup table that contains the lookup values we are searching for
-
Case Insensitive Find: Fairly self-explanatory, this makes the search non-greedy i.e. HeLlO, HELLO and hello are all seen as the same by the tool
-
(Append Field(s) to Record) [ContainsFilter]: You’ll notice this is a radio button option - as standard the tool is set to ‘Replace Found Text With Value:’ and, when a match to a lookup value is found, this would be replaced with another string that you’d provide in a replacement value field. For our scenario, we just need something to use for filtering and so we’ve elected for the ‘Append Field(s) to Record’ configuration, choosing
[ContainsFilter]
as the field. What this means is that, when a match is found in the[VIOLATION DESCRIPTION]
, the[ContainsFilter]
text will be appended to the record. Crucially for us, if a match isn’t found, there will be a null record in the[ContainsFilter]
field
We can see in the tool’s output where matches have/haven’t been found:
So now that we have this new field that we’re essentially using as a flag, we can simply employ a Filter tool to remove null records, as these don’t contain any of our lookup words of interest. The result of that Filter is now our nice, lean data set:
After the Filter, we just used a Select tool in order to remove the temporary flag column. One important thing to note is that the Find Replace tool only works with strings. Therefore, if you’re working with things like numeric data, you’ll need to do a data type conversion beforehand.
Dynamically create a Contains() list and Append Fields
Our final method can be used for the same purposes as those above; depending on your configuration, you can either look for partial matches, or both partial and exact matches simultaneously. For these examples, we’ll be returning to our list of dates that we want to isolate. However, this time we’re going to want to concatenate our lookup field into a separated list like so:
Once we’ve created the list, we then look to use the Append Fields tool in order to place this next to each record to be used for filtering. Users often shy away from Append Fields because of the risk of exploding our data (which is of course a genuine concern). However, when we’re appending a single value/cell, this isn’t a big deal. Anyway, that leaves us with something like this:
Now we move to the Filter step and this first option is greedy i.e. will allow both exact and partial matches to pass through. We start off by using the expression: Contains([LookupField],[TargetField])
, which looks for our target field’s value within the large concatenated string we created earlier. By removing a bunch of columns we can easily see that the Filter has worked as expected:
Finally, we can apply an almost identical expression in order to ensure that we only get exact matches with this method. In order to make the search non-greedy, we simply need to wrap our values in something that makes them more specific i.e. quotes. In the Filter tool, we then ensure that we put the same character around our [TargetField]
. That looks a bit like this when creating the list:
And when writing the Filter expression:
So how does this actually make things more precise? Well in this case, let’s say one of our dates had been entered incorrectly or somehow been trimmed and ended up looking like 05/02/2 instead of 05/02/2019 - when applying the first method i.e. without wrapping, we’re simply looking for whether 05/02/2 appears within 04/29/2017,05/02/2019,10/03/2017… which it does, even if it’s not a full match of that date. However, when we wrap each value, and do the same to the target value field, we end up looking for “05/02/2” in "04/29/2017","05/02/2019","10/03/2017"... which it doesn’t find because of the added quotes.
So there we have it! A handful of methods that you can use to make your life easier and save the headache of enormous expressions when filtering multiple values. Whilst these techniques are useful, setting up your lookup tables properly and maintaining them is essential to ensuring you don’t run into problems further down the line. As usual, we hope you have found this blog useful and please feel free to reach out with any feedback, criticism or suggestions for future content!
This article was originally published on Databasyx.com
Data Engineer
Co-founder @ Databasyx.com | Inspire 2024 speaker | London User Group leader | Community Top Contributor | Consultant | Continuous learner | I like building and breaking stuff
Co-founder @ Databasyx.com | Inspire 2024 speaker | London User Group leader | Community Top Contributor | Consultant | Continuous learner | I like building and breaking stuff
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.