Using fuzzy match on a list of names that vary significantly
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi all,
I have a data set with a list of names that is manually entered. Here I could have several differing spellings of the same person. For example:
Colin Hayward
Colin Haywood
Colin Hayword
Colin Heywood
Collin Hayword
To me, I can spot that it is the same person, but how can I get Alteryx to do this for me? The list is too long to hold an Index and the name spelling could change with each report. I have also tried using first initial and surname, but again, that assumes the first name is spelt correctly.
I would like a Fuzzy match logic or equivalent that could get me to a 90% solution with only a bit of manual work left over. Not sure 100% solution is achievable here.
Any ideas would be really appreciated.
David.
- Labels:
- Fuzzy Match
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @Kearnd967 , you can try fuzzy matching, in simple situations like your example it can be great:
However it also has some problems. With dynamic data, it can be hard / impossible to eliminate incorrect matches. Because of this, I'd advise being very careful using this tool in dynamic workflows.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
The attached configuration identifies all duplicates in your sample data.
Try different options for Match Function, like Jaro Distance and Levenshtein. And try a different Match Threshold.
Like you mentioned, a Fuzzy Match will never be perfect.
Chris