Tuning the fuzzy matching tool
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hello the community,
I try to understand the logic and the functions of the various options for the fuzzy matching tool.
I'm currently having difficulties to use it on my workflow here attached.
What I try to do:
I have a table relatively messy, with countries and cities, and I would like to correct it based on the Cities database.
I would like first correct the countries (I might be able to correct both at once, but I know that some cities are in the country field, and vice versa, and that might not work as expected).
If a country cannot be identified in the Cities database, I rename it "Others".
But I see a lot of unmatched countries that should have been identified.
What could be changed to improved these results?
Thanks in advance for your help.
Solved! Go to Solution.
- Labels:
- Fuzzy Match
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thank you Josh for your reply.
I'm analysing it by looking at what has been accepted and what has been discarded at each step.
And I observe curious behaviors.
- First, I will have to check between each step if some cities and countries have been misplaced (a city in the country field and the corresponding country in the city field). It is for example the case at the 4th step (match at 85%) with the city of Macau (China) that is recognized as the Macao country.
- In the same step, I see Austria-AT that is recognized as Austria in record #1, and as Australia in record #8.
Is there a way to prevent it ? I could remove any character after the "-" character, but that could prevent the detection of other countries. - Another effect I don't really understand: If I count the number of results after the first join tool, there are 106 records in the Left part, 167 records in the Join part, but only 218 records enter in this tool. Which means the fuzzy match tool produces duplicates. How can it be possible?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Unfortunately Macau and Macao will match off as the names are pretty close to each other. You could perhaps run a check against a city database first to find and remove any exact matches before processing the remaining cities.
- & 3. The reason Austria-AT matches with both Austria and Australia is because it is a duplicate based on the match key. It looks like the fuzzy match tool shows all possible matches within the threshold and we must choose the best option based on the score each time. This must be where our duplicates are coming from.
Now unfortunately Australia has a higher match score than Austria (88 to 85). I personally believe this is because of the "-AT" portion in the country name. I've removed the hyphen with the formula tool using Regex.
I've updated the flow to accommodate the above.
Adjustment:
Formula:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hello Josh,
I would like to thank you again for your help.
I've continued to work on the fuzzy match process. It's working quite well now.
I finally did the following:
Fuzzy match in 4 steps:
1/ Fuzzy match on countries only, to identify correct country names
1.a/ I only look on a match between country columns, with a strong match percentage
1.b/ For records that don't match, I compare the city column with the country column of the reference dataset.
1.c/ I start the process again, at a lower match percentage
2/ Fuzzy match on cities (and countries, to prevent the case of the same city name in various countries)
I've also include a dictionary to modify few countries before the fuzzy match process, in order to get better results.
It was quite long to create and to parameter, but the result seems very satisfying now.
The only remaining uncertainty: will the result be still satisfying enough if I get new data, as the process will be automatized.
By the way, is there a doc or a video that help to parameter the fuzzy match tool?
I find it official description (https://help.alteryx.com/11.3/FuzzyEditMatchOptions.htm) not very clear on how to customize it.
