Hi, I have two sets of data ( Company names ) that I have been able to fuzzy match successfully - however I have the following issues:
- A lot of names are just matched on the basis of common words like 'Corporation', 'International', 'Holding, 'Inc', 'Ltd' etc. I want to eliminate such matches based on such list of words only. How do i do that? If I increase the match threshold - it is eliminating some key returns as well: for eg:
ABC Partners Inc & ABC Inc
- I am also getting multiple matches for a particular company name with different match score - is it possible to only keep the highest match score data?
To remove specific words from being considered in the Fuzzy Match you can click on 'Edit' and inside the fuzzy match configuration tick 'Generate Keys for Each word'. You can either list the words you have referenced in your post by typing in the whitespace or use the pre-configured dropdown selections.
To remove any duplicates and select the highest match score, you can add a sort tool after the fuzzy match tool (Match score - descending) and then use a unique tool on RecordID & RecordID2 (this will give you the first unique combinations and de-duplicate your data).
Thanks for your reply . I tried as you suggested but as my data sources contain 10,000s of rows - I have 100s of common words and what I realised is the fuzzy match tool is doing while comparing two strings is not exactly match words - sometimes if one string has n characters in common with the other string - the match score is 50 plus. I tried to use Jaro - It is throwing up way to much irrelevant results - then I switched to Words by Lev. - but still the common word list is big and I believe case sensitive too - is there any except manual input in that box to let Alteryx know?
I realised for case sensitivity I can convert all date to Uppercase - so that is sorted - however for common words if I combination of two- three common words like INTERNATIONAL HOLDINGS LTD or RESOURCES CORPORATION - how do i restrict those matches?