Alteryx Designer Desktop Knowledge Base

Definitive answers from Designer Desktop experts.

Tips and Tricks for Fuzzy Matching

Alteryx_KB
Alteryx
Alteryx
Created
  1. Any fields that you will be Fuzzy Matching on that relate to address information should be run through the CASS tool in order to standardize the record values. The original address fields should be deselected from the data stream via a Select tool so as to prevent confusion and to minimize processing time (recall optimization tips…).
  2. When merging two data files (Merge Fuzzy Match process), use a join to remove any exact matches from the fuzzy match process. Along the same line of thought, you can use the Unique tool to remove any exact matches if before entering your Purge Fuzzy Match process.
  3. In a Merge Fuzzy Match, usually the left side of the Match is the Master file (for example, the Experian HH file or the Info USA file). The right side is the customer file, or the file we are trying to match to the master file. Given this setup, in each of the different passes of the fuzzy match we do not send records that have a match from the left into the next pass if they have matched.
  4. The logic of the fuzzy match is that one or more things will be considered to be the strong thing or the thing we are most confident about. The other stuff will be the things we are going to isolate via fuzzy logic in order to find a match. Example would be Address and ZIP for the strong piece and First and Last name as the fuzzy match piece.
  5. Match Thresholds and Weights:
    1. For the matching process occurring within the entire scope of a Fuzzy Match tool we define the Total Match Threshold (the final score).
    2. For each field we are processing in the Fuzzy Match tool, we can declare a match threshold and a match weight for that field.
    3. Whether it occurs at the field-level or Fuzzy Match tool-level, the match threshold is a strict cut off value. Matches that generate a match value greater than your match threshold are accepted as matches, those that fall below, are dropped as match candidates.
    4. The match weight of a particular field, allows you to apply a relative importance to the various fields that you are matching upon. If you have 3 fields, where the match weight for each is A=100, B=100, C=75, then you are telling Alteryx that field C should have a lesser contribution to the determination of whether or not a match is made. (As humans, think of how we might select a partner to dance with…)
  6. After passing your data through the Fuzzy Match tool, join back to your original data to compare the matches. You can Sort by match score, to see if you need to adjust your match thresholds higher or lower depending on the presence of false-positive matches, or the absence of false negative matches. As Fuzzy Matching is inherently fuzzy, it is quite common, and in fact necessary to run your module many times with different parameters. As a general rule, the more consideration you place on developing a thorough fuzzy matching module, the greater the value of your output.
  7. Optimizing fuzzy matching processing time:
    1. Because fuzzy matching can require you to run your module many times, it is prudent to prep your data and save it out to a .yxdb file. Saving your data out to .yxdb files, will allow you to use the .yxdb files as an Input to your fuzzy matching module. Alteryx can read a .yxdb file faster than other file types, so this is a great place to start with optimization.
    2. Another step in data preparation, is use the Auto Field tool, which allows Alteryx to select the most appropriate field type and length for every field in your dataset. Depending upon your input data, this can provide shocking improvements in speed.
    3. Assuming you will be doing a merge fuzzy match, your files will require both a record ID field, and a source field; you might as well add them now.
    4. Lastly, there is no point in bringing fields into your Fuzzy Match module that you do not need, use a Select tool to remove them now.
    5. Finally, use your newly optimized files as .yxdb Inputs to your fuzzy matching module. –To summarize, prep data in one module, then Fuzzy Match it in another.
  8. Depending on whether you are DeDuping, Merging, or both, try to minimize your cross checking and extraneous Browse tools into the data when they are no longer necessary. By all means, cross-checking your data is the most important process in fuzzy matching, but for large modules, it is often helpful to remove tools that are no longer necessary.
  9. If you are new to Fuzzy Matching, the most important thing you can do is work through the 2 sample modules included with Alteryx (DeDupeFuzzyMatching, and MergeFuzzyMatch), which can be found at File > Open Sample > Advanced Samples.

fuzzy.PNG

Comments
JS_dup_135
5 - Atom

Hello, 

Thank you for this informative article. 

 

I would be keen to see your suggestions on how to exclude words from the fuzzy match. 

 

I am matching company names. I have records such as  'ABC International Transport Services Ltd'  and 'XYZ International Transport Services Ltd'.

 

They are different companies, but 'ABC' and 'XYZ' are a small proportion of the entire string, and hence 'International Transport Services' increases the match score to 95-97% and hence introduces false positives. 

 

I have tried excluding via "Don't generate keywords for the following words"  - but I am not sure if that made any improvement. 

 

Thanks

JS

AndyM
Alteryx
Alteryx

@JS_dup_135 Take a look at using Word Frequency Statistics as part of the Fuzzy Matching.  Here is a reference in the Help to that topic. https://help.alteryx.com/11.0/index.htm#FuzzyEditMatchOptions.htm?Highlight="Word Frequency Statistics"

 

If, for example "International" and "Transport" and "Services" occurred often in the data, the frequency stats would tell the Fuzzy Matching to place less emphasis on those words.  The higher the frequency of a word in the data, the less the emphasis for matching.

 

Andy

rdalley
6 - Meteoroid

Sorry but another newbie fuzzy matching question - this statement from Tips above has me wondering:

 

3. In a Merge Fuzzy Match, usually the left side of the Match is the Master file (for example, the Experian HH file or the Info USA file). The right side is the customer file, or the file we are trying to match to the master file. Given this setup, in each of the different passes of the fuzzy match we do not send records that have a match from the left into the next pass if they have matched.

 

I'm seeing why the author said "usually" as I've seen this to be the case and sort of maddening. Sometimes my "master file" keys are on the left and sometimes on they are on the right.  This makes for additional protective processing when matching up the fuzzy results to the ingoing data.  In other words, program your logic to consider the master file keys could be on the left or the right.  Is there any way to ensure the master is always on the left?

 

Rob

RistoKarinkanta
6 - Meteoroid

Hear hear to rdalley, it is indeed tedious to need to figure out after merge fuzzy match what is the right way of the values, in practice you need an extra join after fuzzy tool.