Hello, I have a data set that includes various names of individuals and businesses and I want to be able to get the name field cleaned up and group all the variations of the same name or business and remove the unnecessary characters and combine the name into 1 group and then summarize the number of transaction IDs I have for that individual. I have a lot of records with a lot of different names spelled in various ways so I would prefer not to have to do a contains calc for each name. Below I have included the before and after of what I am looking for. Also I would then like to tie it back to the main data source to use it as the new Business name field.
Before (current Data):
Business Name Transaction ID
Advanced Med | BBB33 |
Advd Med (Local) | TT22 |
Advanced Medicine (not Local) | TP89 |
Ben Frank | abc123 |
Benjamin Franklin | def123 |
Benjamin Franklin | ghi23 |
Advanced Medicine (not Local) | GGG22 |
Advanced Medicine (not Local)2223 | HH222 |
Advanced Medicine (not Local) | II222 |
After(What I would like to see)
Business Name | Count of transaction IDs |
Advanced Medicine | 5 |
Benjamin Franklin | 3 |
Solved! Go to Solution.
What would you recommend though if you have over 5000 records with a lot more business names. I have tried the fuzzy match tool and grouping but I cannot get it to work properly, and I end of with still different variations of business names I would like to have combined.
Hi @bmcmanu1
I can still recommend the lookup table as the names are repeated many times and with limited lookup values you will be able to achieve the output.
As I still do not have the insight on your 5000 records, recommending the same lookup value based on the sample sahred.
Many thanks
Shanker V
Hi @bmcmanu1 - this one is a challenge! I came up with something that will get you most of the way there, I think, but you will still have to use a little manual intervention. I'd recommend maybe adding in some steps in the middle to alert you if you get names that can't be categorized. Anyway, these are the basic steps I used:
- Regex to get rid of anything in parentheses and after
- A manual table of exceptions to handle things that can't be matched (in this case, the only one I just couldn't get to match was "Advd" - if you replace that with "Advanced" it matches)
- Fuzzy match by "Name w/Nicknames" (I'm sure you could tweak this for better matches, too)
- Replace original names with the matched names and count
I hope this helps!! Maybe someone with more Fuzzy Match experience could get you the rest of the way there.