Hello! I was wondering how I would be able to remove certain rows of data that are essentially duplicates, but cannot technically be removed using the Unique tool due to slight differences in the presented data. How would I go about removing the 1st row with an incomplete address? And maybe even the 3rd row with the abbreviated version of the address? I have attached an image for reference. Thank you in advance!
Solved! Go to Solution.
Hi @trvu8447
Dealing with humans is Hard! Specifically, human entered text. That's where the Fuzzy Match tool comes in. It analyzes the text in the inputs and attempts to score the items according to how likely it is that they represent the same underlying entity. Its tricky to use with many options, so read up on it in the tool mastery article, as well as googling the various algorithms available in the tool
The basic strategy is this.
1. Do a straight join on the field. This will remove all the exact duplicates and make it easier and faster to run the Fuzzy matching
2. Depending on the type of field that your processing, do some preprocessing to clean it up. For example, if you're analyzing an address field, you can use a lookup table coupled with a Find Replace to standardize the address parts. Street and St. both map to Street. Route, Rte and Rt all map to Route.
3 Feed the unmatched remainders into the Fuzzy Match tool. Make sure to specify the type of matching that you're doing, Name, Address, etc.
Dan
Solved, thank you!