Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.
ned_blog
8 - Asteroid

A question came from our own Professional Services group:

They are requiring a lot of fuzzy matching on business names. Is there any documentation on how the weighting occurs when using Word Frequency Stats. They have traditionally removed common words from the match name before matching using a third party matcher I am not familiar with. They had to do this because there was not the capability of the word stats. I need to be able to explain how this works, and compare this process to the matching results when frequent words are completely removed.

 

There is a good reason for weighting instead of removing common words. The following are all real company names from the D&B database: "CITY INC", "FIRM INC" "COMPANY INC" "THE COMPANY INC", "STORE INC", "CLUB INC", etc.... If we removed common words we'd have nothing left and couldn't possible match those names.

 

The solution is to use statistical weighted averages. Say we have a company named HARDING INC, we still count INC, but we count HARDING much higher. This allows us to match names that are only common names while at the same time not having the common words interfere with good match of less common names. Alteryx does this by supplying a statistics file with the word based matching.

 

The weight for a word is effectively:

 

log(totalWordCount/thisWordCount)

 

TotalWordCount is the sum of all the word counts in the statistics file.

 

thisWordCount is the count from the statistics file or if not present, 1.0

 

If the statistics file is truncated such that the last word in the stats has a count >1, all of the counts and the total are divided by the # such that the last word has an equivalent weight of an unknown word. This is to prevent a discontinuity in the math.

 

Below is a math workup from the generic US Company Statistics. The last word is OLSON with a count of 2605, so the whole file gets scaled down with that. The net effect is that an unknown word is weighted 3X higher than INC. and 1.9X higher than CHURCH.

 

If the statistics file were not truncated at 200 records (or a dummy record added with a count of 1), an unknown word would weight 4.4X INC and 3.5X CHURCH.

 

Below are some specific counts and their weights from our generic US Company statistics. The results are always better if you calculate the statistics with your own file.

 

Remember in the Alteryx Runtime directory there is a module you can use to create word frequency statistics on your own databases. That module is in the following directory: \Program Files\SRC\Alteryx4.1\RuntimeData\FuzzyMatch\CollectStats.yxmd

 

You will want to save your own resulting *.yxdb file to that directory so you can use it in your Fuzzy Match analysis.

 

Word Count Weight
Total 29190175  
& 1653048 1.246951
INC 1304729 1.349716
SVC 569248 1.709935
CO 548111 1.726368
CTR 328941 1.948119
OF 313313 1.969258
ASSOC 231113 2.101412
CHURCH 226418 2.110326
AUTO 220870 2.1211
A 217291 2.128195
MD 211605 2.139711
CORP 176711 2.217973
UnkownWord 2605 4.049429
FullStatsUnknown 1 7.465237