This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I'm struggling with fuzzy matching on company names using purge mode. I have hundreds of thousands of company names that require fuzzy matching and grouping. Since most of the data appears to have similar inconsistencies, I wanted to start with a small sample of 3 companies and hopefully apply a similar logic throughout the entire data set. The sample data is attached and I'm trying to achieve these three groups:
Now with your naked eye, tell me if they are the same business or different? I do wonder how you answered the question because they look very much the same to me. In your example, the LG Electronics, Inc could be a holding company with MobleComm as an entity and entirely different business. Equally possible, DLG could be a parent company and the other names are different.
For your matching and grouping you could be grouping by names that are at a single address or across the globe. The looser the fuzzy match rules are, the more "overmerge" conditions you'll introduce. I recommend that you consider what you want your outcome to be in terms of quality and consider this approach:
Find EXACT matches
Create a "Normal" form of the terms and look for Exact matches on it
Use fuzzy matching on names with a very strict set of rules
You can TOKENIZE the data so that every word becomes its own record. Then summarize the words and count them. Sort the counted words in descending order and you will have common words at the top and unique words at the bottom. Visually decide which are the common words (e.g. Inc, LTD) and use them in a custom normalization routine to remove them from your Name field. Note that you can use this table in your fuzzy match tool, but it takes a little bit of research to create and save to the right location to make the custom table accessible to the tool.
You would still get different versions of the same name, but you'll reduce the data substantially. Using regular expressions, you can remove special characters and possibly see number sets that you'll want to remove. In my opinion, this is a one-time exercise where Alteryx supercharges your manual analysis. Once you create a repeatable process, the data will be consistently matched. If you try to put the data into a fuzzy match without understanding your data, you will likely be disappointed in the results.
I hope that this helps,
Alteryx ACE & Top Community Contributor
Chaos reigns within. Repent, reflect and reboot. Order shall return.