Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Fuzzy Match Newbie

ShannonH
7 - Meteor

I have been researching fuzzy match and I just can't seem to figure it out! Hoping to be pointed in the right direction. :)

I am trying to match up carrier tracking number data between 2 data sources. The data is mostly numerical, but can contain letters and hyphens as well. I am very confused as to how to create a custom match to best try to match up the data. 

 

Some examples of what I am trying to match up

 

75489728654-1

75489728654

 

S85984368

85984368

 

 

Could someone assist in helping me figure out what my options are with this sort of data?

 

Thank you!

 

5 REPLIES 5
ShannonH
7 - Meteor

Also, if anyone has a great How-To fuzzy match that speaks to beginners like me, I would appreciate that as well. 

Rohit_Bajaj
9 - Comet
Hi Shannon, Very simply stated, fuzzy match is a matching algorithm which takes care of non exact matching. i.e. it is not deterministic. It is generally used to take care of various kind of variations, some of the examples might be - 1) Typos (Michael vs Micheal) 2) Nicknames (Michael vs Mike) 3) Phonetic similarities etc. (Michael vs Mikael) 4) Changes coming out of various storage standards like - some times name stored as First Name, Last Name and at times as Last Name, First Name. Fuzzy match algorithms are based on various factors - 1) What is the population one is dealing with? For e.g. match for person name in one country might not be qualified as match in another. 2) What is the nature of underlying data and what real world entity it represents? etc. For e.g. algorithms for Addresses might vary compared to Person Names. Common items for fuzzy matching are - Person Names, Organization Names, Addresses etc. The examples you shared, I think you might like to do a data discovery/pattern analysis first. Clean the data as required , like for the example you shared - remove non numeric characters, remove all elements after(and including) punctuation like '-' etc. Post this you might like to see, how the data looks like and if at all it calls for a fuzzy matching. On the surface it looks like cleansing of data might work in your case and might not call for complex fuzzy routines. Thanks, Rohit Bajaj
ShannonH
7 - Meteor

Thank you! I am giving that a try. Which match style do you think would work best once I clean up the data? 

RodL
Alteryx Alumni (Retired)

You might want to check out the recordings from the Alteryx Virtual Training sessions.

 

http://www.alteryx.com/virtual-training

 

Look under the "Watch a Past Session" tab. There are at least a couple of sessions on fuzzy matching.

 

Rohit_Bajaj
9 - Comet

Hi Shannon,

 

Please try with the data cleansing part first and then see what the pattern looks like.

Also need to know the differences between storing mechanism of different source systems i.e. what actually classifies as a dulpicate -

 

For e.g. 123 and 132 - Are they same or they are different based on position of occurence of digits?

1234 and 12345 - Are they different or are same, say '5' is an additional information on top of 1234, then they can be taken as same etc.

 

For the case at hand -

 

If 12345 is some pseudo zip code and 12345-6789 is same zip followed by extension then the hypen along with following digits should be removed based on the functional knowledge. In this case post removal of zip extension gives us exact match.

 

Another e.g. - say 12345 is some license number and P12345 signifies that the license is given to a Person (P). In case we are dealing with a domain which in based on Persons only, P can be removed and again there would be an exact match.

 

Without doing the first round of cleansing supported by functional backround and data storing mechanisms, we would not have full picture as in which logic to use.

 

If you are looking for just a general answer then Hamming Distance is generally used for numeric data matching.

 

PS: I am yet to see Fuzzy Match Tool in Alteryx in detail. Above is more of a de-depulication answer.

 

Thanks,

Rohit Bajaj

Labels