Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Tuning the fuzzy matching tool

FVarlet
5 - Atom

Hello the community,

 

I try to understand the logic and the functions of the various options for the fuzzy matching tool.

I'm currently having difficulties to use it on my workflow here attached.

 

What I try to do:

I have a table relatively messy, with countries and cities, and I would like to correct it based on the Cities database.

I would like first correct the countries (I might be able to correct both at once, but I know that some cities are in the country field, and vice versa, and that might not work as expected).

If a country cannot be identified in the Cities database, I rename it "Others".

But I see a lot of unmatched countries that should have been identified.

 

What could be changed to improved these results?

 

Thanks in advance for your help.

4 REPLIES 4
JoshKushner
12 - Quasar

Try and fuzzy match with descending match percentages (100% -> 95% -> 90% -> 85% -> 80%) Then unite everything. 

I've attached an updated version of your flow.

 

Flow:

Flow.PNG

 

Results:

result 1.PNG

result 2.PNG

 

FVarlet
5 - Atom

Thank you Josh for your reply.

 

I'm analysing it by looking at what has been accepted and what has been discarded at each step.

And I observe curious behaviors.

  • First, I will have to check between each step if some cities and countries have been misplaced (a city in the country field and the corresponding country in the city field). It is for example the case at the 4th step (match at 85%) with the city of Macau (China) that is recognized as the Macao country.
  • In the same step, I see Austria-AT that is recognized as Austria in record #1, and as Australia in record #8.
    Is there a way to prevent it ?  I could remove any character after the "-" character, but that could prevent the detection of other countries.
  • Another effect I don't really understand: If I count the number of results after the first join tool, there are 106 records in the Left part, 167 records in the Join part, but only 218 records enter in this tool. Which means the fuzzy match tool produces duplicates. How can it be possible?
JoshKushner
12 - Quasar
  1. Unfortunately Macau and Macao will match off as the names are pretty close to each other. You could perhaps run a check against a city database first to find and remove any exact matches before processing the remaining cities.
  2. &  3. The reason Austria-AT matches with both Austria and Australia is because it is a duplicate based on the match key. It looks like the fuzzy match tool shows all possible matches within the threshold and we must choose the best option based on the score each time. This must be where our duplicates are coming from.

Now unfortunately Australia has a higher match score than Austria (88 to 85). I personally believe this is because of the "-AT" portion in the country name. I've removed the hyphen with the formula tool using Regex.

 

I've updated the flow to accommodate the above.

 

Adjustment:

adjust.PNG

 

Formula:

formula.PNG

 

 

FVarlet
5 - Atom

Hello Josh,

 

I would like to thank you again for your help.

 

I've continued to work on the fuzzy match process. It's working quite well now.

I finally did the following:

Fuzzy match in 4 steps:
1/ Fuzzy match on countries only, to identify correct country names
1.a/ I only look on a match between country columns, with a strong match percentage
1.b/ For records that don't match, I compare the city column with the country column of the reference dataset.
1.c/ I start the process again, at a lower match percentage

2/ Fuzzy match on cities (and countries, to prevent the case of the same city name in various countries)

 

I've also include a dictionary to modify few countries before the fuzzy match process, in order to get better results.

 

It was quite long to create and to parameter, but the result seems very satisfying now.

The only remaining uncertainty: will the result be still satisfying enough if I get new data, as the process will be automatized.

 

By the way, is there a doc or a video that help to parameter the fuzzy match tool?

I find it official description (https://help.alteryx.com/11.3/FuzzyEditMatchOptions.htm) not very clear on how to customize it.

Labels