I'm working on a project that requires me to match records from one source to a different source. We have 4 out of 50+ fields that need to match. Once we have a match we want the matching record from each source to be pulled into a new dataset as two records with a unique ID added to reach record to tie them as a match. Remaining records from each source will then move forward to other match testing.
Now unfortunately, the data in the fields to match on may have extra white space, duplicated formatting, hidden characters, missing data, etc. So before we enter the matching process we have strong many tools together to "clean up" the data so it has half a chance at matching up.
I currently have it written as a join, but I'm getting pushback that the join tool should not be used at all and the Fuzzy Match tool should be used. The debate we are having is is it better to use a join to make the matches or the fuzzy match tool, or something else in regards to reliability of having correct matches if some new uncleaned data makes it through our initial "cleaning" process?
I can see benefits of using either, and I'm guessing it's more a question of personal preference, but looking for some insight from the community that has way much more experience than I, especially when it comes to fuzzy matching.
Thoughts and/or suggestions greatly appreciated!
Since I know I need to bring evidence to the debate I'm currently building an example of each to see which is more efficient for our specific data.