Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.
Community is experiencing an influx of spam posts. As we work toward a permanent solution, please use the 'Notify Moderator' option on the ellipsis menu to flag inappropriate posts. Thank you for your patience!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Fuzzy Match Configuration

ValerieO
5 - Atom

I’m building a workflow where the objective is to compare a customer list to a list of companies that is provided by a third party.  We want to make sure that every customer appears on the third party’s list.  There may be differences in the naming convention, which is why we typically run a fuzzy match.  This takes a long time to perform in Excel, so we’re looking to get this into Alteryx. 

 

I’ve provided examples of two workflows I’ve created and some sample data.  The original customer file I’m working with has 3,240 entries.  The original third party list has 6,770 entries.  Within both files, the same company name may repeat, but it has different # identifiers.  (The # identifiers have no correlation between the files, we don’t need to match based on them, we just want to keep all that original data in our output.) 

 

When I run workflow number one, the output is 35,142 entries.  (Doing a fuzzy match in Excel only results in the 3,240 entries.)  There are a lot of duplicates because of the repeating company names/different # identifiers.  I tried to remove these by adding a “unique” step in the second workflow, but am definitely not applying it in the correct manner.  Also, even when a customer name matches a name from the third party, it continues to try to match it to other names (i.e. filter the output on customer name “University of Iowa” and see how it matches it against anything with “University” in the name, or “Cloos Electronic” matches with “Electronics”), which is why we’re getting so many entries in the results.  We ignore the 100% matches.  Typically, we only look at matches less than 80%, but the match %s coming out of Alteryx seem to be higher than the Excel match %s we get on the same data set, so I’m also concerned about false positives.  I attached the output from the Excel match as reference.

 

I don’t have another Alteryx resource in my group, so I’m doing my best to teach myself via the trainings, etc. but am having difficulty – the output either has too many entries, or too few and is missing matches (or misses) that should be found.  I know this is end user error, but I am struggling to make the solutions work and would appreciate any insight.

2 REPLIES 2
JagdeeshN
12 - Quasar
12 - Quasar

Hi @ValerieO ,

 

Please find attached one way of implementing what you have describe. If builds on the design structure in you workflows to get you only the unique values in the customer list and their respective match score.

 

Do let me know if  this helps point you in the right direction.

 

Best

Jagdeesh

ValerieO
5 - Atom

Hi @JagdeeshN,

 

I can’t tell you how much I appreciate the time you took to help with this.  I went through the workflow to understand the steps you added.  It all makes sense, but I definitely need some more Alteryx practice so I can eventually get there on my own!

 

I do have one more question – in the output we’d like to see the name from the customer list next to the name from the third party list so at a glance we can tell how close they are, i.e. are they a true match with maybe a misspelling or different abbreviation or is there really no match.  I added the “name” field in the Select tool in the bottom container where there’s the join to find the perfectly matching customer names.  The name now appears in the output next to the customer name but only for those that are a perfect match since it’s coming from that container.  I’ve spent an embarrassing amount of time trying to figure out how to pull the name for the remaining matches that are less than 100% and coming from the top data container.  Does this have to get pulled back in from another place in the workflow?

 

Thank you again!

Valerie

Labels