Now that it's live, don't forget to accept your certification badge on Credly today! Learn more here.
We are currently experiencing an issue with Email verification at this time and working towards a solution. Should you encounter this issue, please click on the "Send Verification Button" a second time and the request should go through. If the issue still persists for you, please email support@alteryx.com for assistance.

Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Fuzzy Matching on 30MM+ File

msimcox
5 - Atom

Hello!

 

I am attempting to run a rather large 'base' set of data (30MM+) and a not so large (100k+) 'compare' set of data into the fuzzy matching tool.  The problem mainly is that the tool doesn't complete in a timely manner.  

 

The match is attempting to compare a 'Company Name' field in the two files, and so I initially used the 'Company Name' match type, and did nothing else.  This didn't complete at all.  

 

In order to resolve the issue, I've tried several things:

  • Added 'State' to the match fields, in order to improve the key generation process, using the 'Address Part' match type
  • Tweaked thresholds (up and down)
  • Eliminated all NULL fields in either Company OR State columns (eliminated about 250k from base file)
  • Parsed rows that 'matched' in an upstream process out of the 'base' file. (eliminated about another 250k)

None of these did much at all.  I can't really add any other match fields, as the compare is meant to look at Company Name by itself.  I thought I could build a batch macro that passed a small set of records in at a time, but not sure how to start on that, nor whether it's the correct route.

 

Additionally, I know that the 'base' file has multiple rows per 'account', and in the case of this match, the same key and match value should ultimately apply.  So I thought I could run 1 unique record per account through, and then retroactively set the match score to the other records after the fact.  

 

Thoughts?

 

Thanks in advance!

 

Matt Simcox

 

 

1 REPLY 1
DultonM
11 - Bolide

Hi @msimcox,

 

This is a lot of data for Fuzzy Matching. The Fuzzy Match tool seeks to compare every possible pair of records, so you are comparing 100,000 x 30,000,000 = 3,000,000,000,000 pairs!

 

As you have already figured out, any records you can remove from either side will help. Definitely eliminate fields that contain Null, definitely unique both data sets on Company Name or by Account, and definitely try to preemptively match records with parsing, a Join, or Reg-Ex.

 

Is it possible to split your data into buckets such that you know the data in a given bucket shouldn't match to data in the other buckets? For example, you could create a rule that (after eliminating words like "the") the first letter of the Company Name has to match. You could then use a batch macro to perform Fuzzy Matching on just the A's then the B's then the C's etc. Because the data is broken down, the total number of pairs you are comparing can be way less.

 

Once you have eliminated all the records you can and have isolated your data into separate buckets, there are a few things you can try in the Fuzzy Match tool to speed up the process:

  • Try a different "Generate Keys" methods (Double Metaphone vs. Soundex, etc.)
  • Change the "Maximum Key Length"
  • Change the "Match Function"

In some quick tests on a random 500,000 record set I have, I found...

  • Decreasing the "Maximum Key Length" slows things way down
  • Increasing the "Maximum Key Length" didn't change the runtime much...but it may make a noticeable difference with more data
  • Changing "Double Metaphone" to "Soundex" produced faster result

I'm surprised Adding State didn't speed things up considerably. I tried that with my dataset and things ran way faster. Adding more variables increases the workload in that more keys are generated and the Match Function has to be performed on more fields. However, if the introduction of the new keys causes a lot fewer records to be sent to the Match Function, then the overall workload decreases because the Match Function is more computationally intense than key generation (I believe).

 

Hope this helps!

Labels