Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Fuzzy match - purge mode

macd279
7 - Meteor

Hello community,

 

I'm struggling with fuzzy matching on company names using purge mode. I have hundreds of thousands of company names that require fuzzy matching and grouping. Since most of the data appears to have similar inconsistencies, I wanted to start with a small sample of 3 companies and hopefully apply a similar logic throughout the entire data set. The sample data is attached and I'm trying to achieve these three groups:

 

LG Electronics, Inc

LG Electronics MobileComm U.S.A., Inc

DLG Electronics Technology (SHENZHEN) CO., LTD

 

As always, thank you for your help!

2 REPLIES 2
MarqueeCrew
20 - Arcturus
20 - Arcturus

@macd279,

 

Let's look at 4 example business names first:

  1. Walmart Superstore 1234 Subway
  2. Walmart Superstore 1234 Supercuts
  3. Walmart Superstore 1234 Pharmacy
  4. Walmart Superstore 1234

Now with your naked eye, tell me if they are the same business or different?  I do wonder how you answered the question because they look very much the same to me.  In your example, the LG Electronics, Inc could be a holding company with MobleComm as an entity and entirely different business.  Equally possible, DLG could be a parent company and the other names are different.

 

For your matching and grouping you could be grouping by names that are at a single address or across the globe.  The looser the fuzzy match rules are, the more "overmerge" conditions you'll introduce.  I recommend that you consider what you want your outcome to be in terms of quality and consider this approach:

 

  1. Find EXACT matches
  2. Create a "Normal" form of the terms and look for Exact matches on it
  3. Use fuzzy matching on names with a very strict set of rules

 

You can TOKENIZE the data so that every word becomes its own record.  Then summarize the words and count them.  Sort the counted words in descending order and you will have common words at the top and unique words at the bottom.  Visually decide which are the common words (e.g. Inc, LTD) and use them in a custom normalization routine to remove them from your Name field.  Note that you can use this table in your fuzzy match tool, but it takes a little bit of research to create and save to the right location to make the custom table accessible to the tool.

 

You would still get different versions of the same name, but you'll reduce the data substantially.  Using regular expressions, you can remove special characters and possibly see number sets that you'll want to remove.  In my opinion, this is a one-time exercise where Alteryx supercharges your manual analysis.  Once you create a repeatable process, the data will be consistently matched.  If you try to put the data into a fuzzy match without understanding your data, you will likely be disappointed in the results.

 

I hope that this helps,

 

Mark

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
macd279
7 - Meteor

Thanks Mark! This is a much better solution  - taking the different approach is giving me much improved results. Thanks for taking the time to help and explain.

Labels