Engine Works

ned_blog · ‎09-25-2009

A question came from our own Professional Services group:

They are requiring a lot of fuzzy matching on business names. Is there any documentation on how the weighting occurs when using Word Frequency Stats. They have traditionally removed common words from the match name before matching using a third party matcher I am not familiar with. They had to do this because there was not the capability of the word stats. I need to be able to explain how this works, and compare this process to the matching results when frequent words are completely removed.

There is a good reason for weighting instead of removing common words. The following are all real company names from the D&B database: "CITY INC", "FIRM INC" "COMPANY INC" "THE COMPANY INC", "STORE INC", "CLUB INC", etc.... If we removed common words we'd have nothing left and couldn't possible match those names.

The solution is to use statistical weighted averages. Say we have a company named HARDING INC, we still count INC, but we count HARDING much higher. This allows us to match names that are only common names while at the same time not having the common words interfere with good match of less common names. Alteryx does this by supplying a statistics file with the word based matching.

The weight for a word is effectively:

log(totalWordCount/thisWordCount)

TotalWordCount is the sum of all the word counts in the statistics file.

thisWordCount is the count from the statistics file or if not present, 1.0

If the statistics file is truncated such that the last word in the stats has a count >1, all of the counts and the total are divided by the # such that the last word has an equivalent weight of an unknown word. This is to prevent a discontinuity in the math.

Below is a math workup from the generic US Company Statistics. The last word is OLSON with a count of 2605, so the whole file gets scaled down with that. The net effect is that an unknown word is weighted 3X higher than INC. and 1.9X higher than CHURCH.

If the statistics file were not truncated at 200 records (or a dummy record added with a count of 1), an unknown word would weight 4.4X INC and 3.5X CHURCH.

Below are some specific counts and their weights from our generic US Company statistics. The results are always better if you calculate the statistics with your own file.

Remember in the Alteryx Runtime directory there is a module you can use to create word frequency statistics on your own databases. That module is in the following directory: \Program Files\SRC\Alteryx4.1\RuntimeData\FuzzyMatch\CollectStats.yxmd

You will want to save your own resulting *.yxdb file to that directory so you can use it in your Fuzzy Match analysis.

Word	Count	Weight
Total	29190175
&	1653048	1.246951
INC	1304729	1.349716
SVC	569248	1.709935
CO	548111	1.726368
CTR	328941	1.948119
OF	313313	1.969258
ASSOC	231113	2.101412
CHURCH	226418	2.110326
AUTO	220870	2.1211
A	217291	2.128195
MD	211605	2.139711
CORP	176711	2.217973
UnkownWord	2605	4.049429
FullStatsUnknown	1	7.465237

Engine Works

Fuzzy Match: Word Statistics