Hello the community,
I try to understand the logic and the functions of the various options for the fuzzy matching tool.
I'm currently having difficulties to use it on my workflow here attached.
What I try to do:
I have a table relatively messy, with countries and cities, and I would like to correct it based on the Cities database.
I would like first correct the countries (I might be able to correct both at once, but I know that some cities are in the country field, and vice versa, and that might not work as expected).
If a country cannot be identified in the Cities database, I rename it "Others".
But I see a lot of unmatched countries that should have been identified.
What could be changed to improved these results?
Thanks in advance for your help.
Solved! Go to Solution.
Thank you Josh for your reply.
I'm analysing it by looking at what has been accepted and what has been discarded at each step.
And I observe curious behaviors.
Now unfortunately Australia has a higher match score than Austria (88 to 85). I personally believe this is because of the "-AT" portion in the country name. I've removed the hyphen with the formula tool using Regex.
I've updated the flow to accommodate the above.
Adjustment:
Formula:
Hello Josh,
I would like to thank you again for your help.
I've continued to work on the fuzzy match process. It's working quite well now.
I finally did the following:
Fuzzy match in 4 steps:
1/ Fuzzy match on countries only, to identify correct country names
1.a/ I only look on a match between country columns, with a strong match percentage
1.b/ For records that don't match, I compare the city column with the country column of the reference dataset.
1.c/ I start the process again, at a lower match percentage
2/ Fuzzy match on cities (and countries, to prevent the case of the same city name in various countries)
I've also include a dictionary to modify few countries before the fuzzy match process, in order to get better results.
It was quite long to create and to parameter, but the result seems very satisfying now.
The only remaining uncertainty: will the result be still satisfying enough if I get new data, as the process will be automatized.
By the way, is there a doc or a video that help to parameter the fuzzy match tool?
I find it official description (https://help.alteryx.com/11.3/FuzzyEditMatchOptions.htm) not very clear on how to customize it.