This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I have been researching fuzzy match and I just can't seem to figure it out! Hoping to be pointed in the right direction. :)
I am trying to match up carrier tracking number data between 2 data sources. The data is mostly numerical, but can contain letters and hyphens as well. I am very confused as to how to create a custom match to best try to match up the data.
Some examples of what I am trying to match up
Could someone assist in helping me figure out what my options are with this sort of data?
Hi Shannon, Very simply stated, fuzzy match is a matching algorithm which takes care of non exact matching. i.e. it is not deterministic. It is generally used to take care of various kind of variations, some of the examples might be - 1) Typos (Michael vs Micheal) 2) Nicknames (Michael vs Mike) 3) Phonetic similarities etc. (Michael vs Mikael) 4) Changes coming out of various storage standards like - some times name stored as First Name, Last Name and at times as Last Name, First Name. Fuzzy match algorithms are based on various factors - 1) What is the population one is dealing with? For e.g. match for person name in one country might not be qualified as match in another. 2) What is the nature of underlying data and what real world entity it represents? etc. For e.g. algorithms for Addresses might vary compared to Person Names. Common items for fuzzy matching are - Person Names, Organization Names, Addresses etc. The examples you shared, I think you might like to do a data discovery/pattern analysis first. Clean the data as required , like for the example you shared - remove non numeric characters, remove all elements after(and including) punctuation like '-' etc. Post this you might like to see, how the data looks like and if at all it calls for a fuzzy matching. On the surface it looks like cleansing of data might work in your case and might not call for complex fuzzy routines. Thanks, Rohit Bajaj
Please try with the data cleansing part first and then see what the pattern looks like.
Also need to know the differences between storing mechanism of different source systems i.e. what actually classifies as a dulpicate -
For e.g. 123 and 132 - Are they same or they are different based on position of occurence of digits?
1234 and 12345 - Are they different or are same, say '5' is an additional information on top of 1234, then they can be taken as same etc.
For the case at hand -
If 12345 is some pseudo zip code and 12345-6789 is same zip followed by extension then the hypen along with following digits should be removed based on the functional knowledge. In this case post removal of zip extension gives us exact match.
Another e.g. - say 12345 is some license number and P12345 signifies that the license is given to a Person (P). In case we are dealing with a domain which in based on Persons only, P can be removed and again there would be an exact match.
Without doing the first round of cleansing supported by functional backround and data storing mechanisms, we would not have full picture as in which logic to use.
If you are looking for just a general answer then Hamming Distance is generally used for numeric data matching.
PS: I am yet to see Fuzzy Match Tool in Alteryx in detail. Above is more of a de-depulication answer.