I have been researching fuzzy match and I just can't seem to figure it out! Hoping to be pointed in the right direction. :)
I am trying to match up carrier tracking number data between 2 data sources. The data is mostly numerical, but can contain letters and hyphens as well. I am very confused as to how to create a custom match to best try to match up the data.
Some examples of what I am trying to match up
75489728654-1
75489728654
S85984368
85984368
Could someone assist in helping me figure out what my options are with this sort of data?
Thank you!
Solved! Go to Solution.
Also, if anyone has a great How-To fuzzy match that speaks to beginners like me, I would appreciate that as well.
Thank you! I am giving that a try. Which match style do you think would work best once I clean up the data?
You might want to check out the recordings from the Alteryx Virtual Training sessions.
http://www.alteryx.com/virtual-training
Look under the "Watch a Past Session" tab. There are at least a couple of sessions on fuzzy matching.
Hi Shannon,
Please try with the data cleansing part first and then see what the pattern looks like.
Also need to know the differences between storing mechanism of different source systems i.e. what actually classifies as a dulpicate -
For e.g. 123 and 132 - Are they same or they are different based on position of occurence of digits?
1234 and 12345 - Are they different or are same, say '5' is an additional information on top of 1234, then they can be taken as same etc.
For the case at hand -
If 12345 is some pseudo zip code and 12345-6789 is same zip followed by extension then the hypen along with following digits should be removed based on the functional knowledge. In this case post removal of zip extension gives us exact match.
Another e.g. - say 12345 is some license number and P12345 signifies that the license is given to a Person (P). In case we are dealing with a domain which in based on Persons only, P can be removed and again there would be an exact match.
Without doing the first round of cleansing supported by functional backround and data storing mechanisms, we would not have full picture as in which logic to use.
If you are looking for just a general answer then Hamming Distance is generally used for numeric data matching.
PS: I am yet to see Fuzzy Match Tool in Alteryx in detail. Above is more of a de-depulication answer.
Thanks,
Rohit Bajaj