When Matching: Join or Fuzzy Match?

Question

I'm working on a project that requires me to match records from one source to a different source.  We have 4 out of 50+ fields that need to match.  Once we have a match we want the matching record from each source to be pulled into a new dataset as two records with a unique ID added to reach record to tie them as a match.  Remaining records from each source will then move forward to other match testing.

Now unfortunately, the data in the fields to match on may have extra white space, duplicated formatting, hidden characters, missing data, etc.  So before we enter the matching process we have strong many tools together to "clean up" the data so it has half a chance at matching up.

I currently have it written as a join, but I'm getting pushback that the join tool should not be used at all and the Fuzzy Match tool should be used.  The debate we are having is is it better to use a join to make the matches or the fuzzy match tool, or something else in regards to reliability of having correct matches if some new uncleaned data makes it through our initial "cleaning" process?

I can see benefits of using either, and I'm guessing it's more a question of personal preference, but looking for some insight from the community that has way much more experience than I, especially when it comes to fuzzy matching.

Thoughts and/or suggestions greatly appreciated!

Since I know I need to bring evidence to the debate I'm currently building an example of each to see which is more efficient for our specific data.

Winston · Answer

I was hoping to update my original post, but can find a way to do it, so adding update as a reply.

I'm not able to get my join vs fuzzy match results to match up.  I'm sure it must be something simple that I'm missing.  Would greatly appreciate more eyes looking at it to point out my logic error.

Data file and workflow attached.

Sample Fuzzy Match Data Initial  3.xlsx

Fuzzy Match Testing 3 - Community.yxmd

pedrodrfaria · Answer

Hi @Winston

You can think of this situation like this:

Inside of the fuzzy matching, if the data matches 100%, it means that it would be matched with the Join as well, and whatever matches with less than 100% if you not match with the Join.

So, by applying the fuzzy matching, it means that you will match everything that the Join will match as well as data that should match with small changes. So what I have always done is that I just use the fuzzy match and not even worry about the join.

Unless I want to improve performance and I know that there will be few fuzzy matches, then using the Join and checking what didn't match and then i do a fuzzy matching.

Hope this helps.

Pedro.

ArtApa · Answer

@Winston - Your approach is 100% correct. You always match everything what can be matched and then you Fuzzy Match the rest.