Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Finding possible duplicates in a column

parria1
8 - Asteroid

Each month I have a dirty list compiled from different sources I need to clean. There are a few columns where I want to catch data entry variations so I can be alerted and then fix them with a find and replace via lookup table.

 

I have a column that can contain entries like the below

 

SMITH CITY REG HOSP, INC.
SMITHCITY REGIONAL HOSPITAL, INC.

 

SMITH HEALTH SYSTEM
SMITH HEALTH SYSTEMS

 

SMITH CENTER HOLDINGS
SMITH CENTER HOLDINGS, LLC

 

I need a workflow that will return entries like this so i can review and decide what should be the proper entry.

How would I do this in Alteryx?

6 REPLIES 6
Thableaus
17 - Castor
17 - Castor

Hi @parria1 

 

I recommend you to check the Fuzzy Match tool.

 

Cheers,

parria1
8 - Asteroid

I am using Fuzzy Match and hoping there was an alternative tool and method. I have followed the fuzzy match purge configuration in the help section and it is returning 7.5M results. I am not sure what that is doing. I just need the small list of possible duplicates.

I am using Name match style on the field.

 

Thableaus
17 - Castor
17 - Castor

@parria1 

 

I think you're in the right path. 

Maybe you need to make some adjustments to your Fuzzy Match Style so you can get your duplicates right. 

With your data I was able to do this:

FuzzyMatchSolution.PNG

 

Don't forget to use a Unique Tool after your Fuzzy Match output, otherwise you'll get a lot of similar outputs.

Could you provide some data?

 

See workflow attached with an example.

 

Cheers,

 

parria1
8 - Asteroid

Hi, thank you, yes I appear to be doing exactly what you are doing but am still getting 7.5M results.

Here is my data with sensitive information edited out.

Thableaus
17 - Castor
17 - Castor

@parria1 

 

I'm aware that Fuzzy Match is not such an easy thing to learn. Many people say is an art to master, so here are a few considerations and tips that I can give you:

 

1) Clean your data

I saw you had a lot of empty records and duplicates going into the Fuzzy Match Tool. Don't do that. Clean your data before so you won't have unexpected results like you did. I added a Filter to get only "not empty' records and a Unique Tool to compare only unique records.

print1.PNG

 

2) Watch out your Match Threshold

90% was a high Threshold. It can leave out a lot of possible matches. You have to test as many thresholds you can so you can reach the optimal result. So I changed it to 80%, but you can try with a lower Threshold. There are two thresholds - the one of your Match Style (which is "Name", and by default is 85%) and the overall Threshold. You can play with both of them to get what you desire.

 

insidethreshold.PNGprint2.PNG

 

3) Fuzzy Match sometimes can help, but won't solve all problems at once.

 

I noticed that there are multiple instances of "VA-VISN" records with numbers in front of them. Maybe you could use a REGEX_REPLACE to handle those. Fuzzy Match is useful for spelling mistakes or very similar and variable names. That's up to you, but think about it.

print3.PNG

 

 

I'm not being able to attach the workflow, but these were the changes I made. Let me know if you have any questions.

 

 

Cheers,

parria1
8 - Asteroid

Thank you I had a few tools in the wrong order so my list was not unique and I confused is not empty with is not null.

 

I managed to get the output I want. The VISNs will cause issues but unfortunately I need them just in case there is some odd data entry on them as well.

 

Thanks!Capture.PNG

Labels