Hi all,
I am using https://www.youtube.com/watch?v=JlZiM2YSCCY as a guide to fuzzy match two source of company names. However, I am getting non-ideal outputs that I can use the community's help on to set me straight.
Attached is the self-contained flow in version 9 to illustrate. If you could make the change compatible in version 9 instead of 10, it would be great.
Obejctive is to match company names against each other. Master would have formal name like "Apple Inc."; new would have shortened name like Apple. I need the fuzzy match the two.
Master has:
Apple Inc.
Cisco Systems, Inc.
Microsoft Corporation
Oracle Corporation
New has:
Apple
Cisco
Microsoft
Oracle
After running the merger algorithm through with pre-defined "Company" matching. I only get 1 match instead of 4. Only Apple is being matched with Apple Inc.
Can you help me understand how to get to matching all 4 company between the 2?
Best,
Tom
Solved! Go to Solution.
Tuning your Fuzzy Match tool to work with a data set is an iterative process. Many of the match functions you choose are not going to guarantee all of your fields match correctly. That being said, with the example data you gave, the fields in the New data set are wholly contained in the Master data set.
Next to Match Style in the Fuzzy Match configuration, choose Edit...
Under generate keys choose Whole Field. When Alteryx generates a key, it reads through the whole field and generates a key for each individual component of that field. Using 'Whole Field' will ensure that only keys where the components match will be output by Fuzzy Match.
As for Match Function, choose none.
This example output all four of the matches as you would expect but if your company names in New are not a subset of the company names in Master, this could present an issue.
Thank you Michael on the response.
For now, I imagine "New" list would be a subset of the "master", e.g., Apple instead of Apple, Inc. I will give your suggestion a try.
Would you be able to give me a reference link to more details on how fuzzy match works? I not clearly not understanding terms in the configuration page such as "whole field", match weighting, threshold, etc.
Best,
Tom
The Edit Match Options in the Alteryx Help gives a more in-depth rundown of the steps and options for fuzzy matching. I would start there. Otherwise, help on specific matching algorithms is a bit more difficult to find all in one place. Fuzzy matching is an in-depth field that is not limited to just Alteryx. I would search online for a primer on 'Approximate String Matching'
http://help.alteryx.com/10.1/index.htm#FuzzyEditMatchOptions.htm
Let us know how the matches turn out with the new options.
Ok, so in the module that you uploaded, you had Generate Keys as Double Metaphone, checked Generate Keys for each word, and no Match Function. That generates four matches.
However, when you uncheck the options to generate individual word keys, you lose matches, why?
With maximum key length == 4 and the entire field (not the individual words because we have that unchecked) used, the metaphone for Cisco Systems is SSKS where the metaphone for Cisco is SSK. Simply put, the metaphone key is going to be similar to a spelling of the word with no vowels. 6 key metaphone of Microsoft would be MKRSFT (sound it out in your head).
These keys don't match, so Alteryx doesn't match the fields. You could fix this by checking the option to generate a key for each word. Cisco and Systems in Master will have different keys and will match the Cisco in New.
You could also leave 'Generate Keys for Each Word' uncheck and choose 'Whole Field' to Generate Keys. This would take the first 4 (remember maximum key length == 4) letters of each word and match on that.
You could also leave Double Metaphone, uncheck 'Generate Keys for Each Word' and mess with Match Threshold, Max Key Length, and Match Function to generate matches.
Like I said before, this stuff is not and exact science. You have to generate the model based on your data and use the one that gives you the most confidence.