Engine Works

aoneill · ‎06-18-2008

I was working with a client who has a contact database with 500 million records. Once a month or so they have a smaller dataset to add to their growing database. Their first step is to perform fuzzy matching so they don't add duplicates to the database. Fuzzy matching against 500 million records takes some time. I went to Ned with my problem and he suggested using the generate keys only option in the fuzzy matching tool (which omits the more granular match function process when selected). Next the keys would be saved in a field in a master Calgary database. Generating and saving the keys would expedite future fuzzy matching runs.

Keys are then generated for the incoming small file with a fuzzy matching tool (again, generate keys only is selected). One word of caution is that the generate keys settings and field order must be identical to the settings that were used to generate keys for the master database. It is best to copy/paste and use same fuzzy matching tool.

Using a Calgary Join and joining on keys would output a smaller dataset to work with and compare. The output from the Calgary Join can then go through one or more fuzzy matching tools (both the generate keys and match function would be applied at this point). Generating keys again would be insignificant since the dataset would be a more manageable size.

-AmyO

Engine Works

Small to Big Fuzzy Matching