Each month I have a dirty list compiled from different sources I need to clean. There are a few columns where I want to catch data entry variations so I can be alerted and then fix them with a find and replace via lookup table.
I have a column that can contain entries like the below
SMITH CITY REG HOSP, INC.
SMITHCITY REGIONAL HOSPITAL, INC.
SMITH HEALTH SYSTEM
SMITH HEALTH SYSTEMS
SMITH CENTER HOLDINGS
SMITH CENTER HOLDINGS, LLC
I need a workflow that will return entries like this so i can review and decide what should be the proper entry.
How would I do this in Alteryx?
Solved! Go to Solution.
I am using Fuzzy Match and hoping there was an alternative tool and method. I have followed the fuzzy match purge configuration in the help section and it is returning 7.5M results. I am not sure what that is doing. I just need the small list of possible duplicates.
I am using Name match style on the field.
I think you're in the right path.
Maybe you need to make some adjustments to your Fuzzy Match Style so you can get your duplicates right.
With your data I was able to do this:
Don't forget to use a Unique Tool after your Fuzzy Match output, otherwise you'll get a lot of similar outputs.
Could you provide some data?
See workflow attached with an example.
Cheers,
I'm aware that Fuzzy Match is not such an easy thing to learn. Many people say is an art to master, so here are a few considerations and tips that I can give you:
1) Clean your data
I saw you had a lot of empty records and duplicates going into the Fuzzy Match Tool. Don't do that. Clean your data before so you won't have unexpected results like you did. I added a Filter to get only "not empty' records and a Unique Tool to compare only unique records.
2) Watch out your Match Threshold
90% was a high Threshold. It can leave out a lot of possible matches. You have to test as many thresholds you can so you can reach the optimal result. So I changed it to 80%, but you can try with a lower Threshold. There are two thresholds - the one of your Match Style (which is "Name", and by default is 85%) and the overall Threshold. You can play with both of them to get what you desire.
3) Fuzzy Match sometimes can help, but won't solve all problems at once.
I noticed that there are multiple instances of "VA-VISN" records with numbers in front of them. Maybe you could use a REGEX_REPLACE to handle those. Fuzzy Match is useful for spelling mistakes or very similar and variable names. That's up to you, but think about it.
I'm not being able to attach the workflow, but these were the changes I made. Let me know if you have any questions.
Cheers,
Thank you I had a few tools in the wrong order so my list was not unique and I confused is not empty with is not null.
I managed to get the output I want. The VISNs will cause issues but unfortunately I need them just in case there is some odd data entry on them as well.
Thanks!