Hello Community
I'm working on matching names in 2 separate tables using the Alteryx Fuzzy Tool.
I need some guidance in terms of how the actual match scores are being calculated behind the scenes.
The following are some sample test names that i have:
RecordID | Source | Field1 |
I | A | John Smith |
II | A | Donald Trump |
III | A | Vladimr Putin |
1 | B | Jan Smith |
2 | B | Jon Smeeth |
3 | B | Jon Sam |
4 | B | Duck Trump |
5 | B | Vlad Power |
6 | B | Donald Trump |
7 | B | John Smith |
8 | B | Vladimr Putin |
The following are the settings i've used on the Fuzzy Tool:
The following results are what i've obtained:
RecordID | Name | RecordID2 | Test | MatchScore | MatchKey |
1 | Jan Smith | I | John Smith | 88 | AN |
1 | Jan Smith | I | John Smith | 88 | JN |
1 | Jan Smith | I | John Smith | 88 | SM0 |
1 | Jan Smith | I | John Smith | 88 | XMT |
2 | Jon Smeeth | I | John Smith | 85 | AN |
2 | Jon Smeeth | I | John Smith | 85 | JN |
2 | Jon Smeeth | I | John Smith | 85 | SM0 |
2 | Jon Smeeth | I | John Smith | 85 | XMT |
3 | Jon Sam | I | John Smith | 80 | AN |
3 | Jon Sam | I | John Smith | 80 | JN |
7 | John Smith | I | John Smith | 100 | AN |
7 | John Smith | I | John Smith | 100 | JN |
7 | John Smith | I | John Smith | 100 | SM0 |
7 | John Smith | I | John Smith | 100 | XMT |
4 | Duck Trump | II | Donald Trump | 74 | TRMP |
6 | Donald Trump | II | Donald Trump | 100 | TNLT |
6 | Donald Trump | II | Donald Trump | 100 | TRMP |
8 | Vladimr Putin | III | Vladimr Putin | 100 | FLTMR |
8 | Vladimr Putin | III | Vladimr Putin | 100 | PTN |
Need a little guidance on how these calculations work.
I've watched the tutorials available here, but it doesn't go to the detail on how the match scores are calculated.
1) Key Generation
Double metaphone seems to have a default max key length of 8.
I've learnt that records in which the keys don't match are discarded.
Does it generate key for each word in a string?
Might need a little help in showing some examples of how keys are generated here.
2)Jaro Distance vs Levenshtein Distance
For Jaro Distance, based on the formula, i can briefly understand how the scoring is done.
But for LD i need a little help, since LD is the number of edits needed to translate String1 to String2.
How is the LD score weighted then translated to a match score, such that it gives JD an equivalent basis for comparison, then allowing the tool to pick the best of both?
3)Word-based vs character (no spaces)
I'm trying to understand how the scoring for this works.
Does it calculate the best of scores (Jaro and LD) for each word in the same string, then take the weight depending on how many words there are?
If so, then when does the scoring for character (no spaces) come in?
4)Preprocess: Strip Punctuation vs Strip Punctuation before Fuzzy Tool vs Character (no spaces)
Preprocess: Strip Punctuation
Because strip punctuation takes a longer time to process using the Tool, i decided not to use the preprocess.
Strip Punctuation before Fuzzy Tool
I've considered using Strip Punctuation before feeding it into the Fuzzy Tool.
The results i have seems to imply that word order is not taken into consideration when i do this (can't find my old examples for some reason).
Word order is important to me, hence i didn't choose this option.
Character (no spaces)
Now for the last alternative; i wanted to understand better on Character (no Spaces) when using a word-based match.
Somehow this gives a better result and also takes into account word order.
I was wondering if someone might be able to better explain how this works, as an alternative to "Strip Punctuation before Fuzzy Tool"?
Prefer with sticking the current settings i have at the moment for simplicity of explanation.
Of course, if there is a better setting i should use, do provide your suggestions as well.
Many thanks on this!
@monsieursnow , please see our help documentation below which should provide some clarity on the questions you have.
https://help.alteryx.com/20221/designer/fuzzy-match-edit-match-options
Have you checked out our Tool Master article as well?
https://community.alteryx.com/t5/Alteryx-Designer-Knowledge-Base/Tool-Mastery-Fuzzy-Match/ta-p/45485
Hello Andrew
Thank you for responding.
I've read the articles you mentioned, watched the video on Fuzzy Matching and also did the interactive lesson on it, before I posted this.
But was still unable to understand it 100%.
There seems to be a proprietary matching engine within the Fuzzy Tool was my conclusion.