Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.
Community is experiencing an influx of spam. As we work toward a solution, please use the 'Notify Moderator' option on the ellipsis menu to flag inappropriate posts.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Basic company fuzzy matching between master and new sources

datawork
5 - Atom

Hi all,

 

I am using https://www.youtube.com/watch?v=JlZiM2YSCCY as a guide to fuzzy match two source of company names. However, I am getting non-ideal outputs that I can use the community's help on to set me straight.

 

Attached is the self-contained flow in version 9 to illustrate. If you could make the change compatible in version 9 instead of 10, it would be great.

 

Obejctive is to match company names against each other. Master would have formal name like "Apple Inc."; new would have shortened name like Apple. I need the fuzzy match the two.

 

Master has:

Apple Inc.

Cisco Systems, Inc.

Microsoft Corporation

Oracle Corporation

 

New has:

Apple

Cisco

Microsoft

Oracle

 

After running the merger algorithm through with pre-defined "Company" matching. I only get 1 match instead of 4. Only Apple is being matched with Apple Inc.

 

Can you help me understand how to get to matching all 4 company between the 2?

 

Best,

Tom

Fuzzy matching is a process that either purges duplicate records from a data stream or merges desperate sources when a common identifier is not present. This video will review the fuzzy matching techniques and algorithms available in the Alteryx platform.
6 REPLIES 6
michael_treadwell
ACE Emeritus
ACE Emeritus

Tuning your Fuzzy Match tool to work with a data set is an iterative process. Many of the match functions you choose are not going to guarantee all of your fields match correctly. That being said, with the example data you gave, the fields in the New data set are wholly contained in the Master data set.

 

Next to Match Style in the Fuzzy Match configuration, choose Edit...

 

Under generate keys choose Whole Field. When Alteryx generates a key, it reads through the whole field and generates a key for each individual component of that field. Using 'Whole Field' will ensure that only keys where the components match will be output by Fuzzy Match.

 

As for Match Function, choose none.

 

This example output all four of the matches as you would expect but if your company names in New are not a subset of the company names in Master, this could present an issue.

datawork
5 - Atom

Thank you Michael on the response.

 

For now, I imagine "New" list would be a subset of the "master", e.g., Apple instead of Apple, Inc. I will give your suggestion a try.

 

Would you be able to give me a reference link to more details on how fuzzy match works? I not clearly not understanding terms in the configuration page such as "whole field", match weighting, threshold, etc.

 

Best,

Tom

michael_treadwell
ACE Emeritus
ACE Emeritus

The Edit Match Options in the Alteryx Help gives a more in-depth rundown of the steps and options for fuzzy matching. I would start there. Otherwise, help on specific matching algorithms is a bit more difficult to find all in one place. Fuzzy matching is an in-depth field that is not limited to just Alteryx. I would search online for a primer on 'Approximate String Matching'

 

http://help.alteryx.com/10.1/index.htm#FuzzyEditMatchOptions.htm

 

Let us know how the matches turn out with the new options.

datawork
5 - Atom

I just changed match function to be "None" to make it work after using the "company default style" first. See the updated zip file

 

If I uncheck the "Generate Keys for Each Word", I get only 2 matches.

 

Any thoughts?

JohnJPS
15 - Aurora

The attached uses fuzzy keys to successfully match the records from the two files.

It seems way too easy so I'm probably completely missing some requirement or another.

michael_treadwell
ACE Emeritus
ACE Emeritus

Ok, so in the module that you uploaded, you had Generate Keys as Double Metaphone, checked Generate Keys for each word, and no Match Function. That generates four matches.

 

 Capture.PNG

 

However, when you uncheck the options to generate individual word keys, you lose matches, why?

 

With maximum key length == 4 and the entire field (not the individual words because we have that unchecked) used, the metaphone for Cisco Systems is SSKS where the metaphone for Cisco is SSK. Simply put, the metaphone key is going to be similar to a spelling of the word with no vowels. 6 key metaphone of Microsoft would be MKRSFT (sound it out in your head).

 

These keys don't match, so Alteryx doesn't match the fields. You could fix this by checking the option to generate a key for each word. Cisco and Systems in Master will have different keys and will match the Cisco in New.

 

You could also leave 'Generate Keys for Each Word' uncheck and choose 'Whole Field' to Generate Keys. This would take the first 4 (remember maximum key length == 4) letters of each word and match on that.

 

You could also leave Double Metaphone, uncheck 'Generate Keys for Each Word' and mess with Match Threshold, Max Key Length, and Match Function to generate matches.

 

Like I said before, this stuff is not and exact science. You have to generate the model based on your data and use the one that gives you the most confidence.

Labels