Super slow fuzzy match workflow

Question

I am running a fuzzy match workflow to match data between two tables. Here are the details:

Table1: 2.8 million rows

Table2: 7000 rows

I only a have a single column to match and generate keys for fuzzy match. I have a 16GB RAM machine and I've set the join/sort memory to 8096MB but it still throws the low physical memory warning.

The source and target tables are both in Redshift and I am not using a bulk loader but I doubt that is the issue since as the snip attached shows, the processing is really slow at fuxxy match and following unique tool.

The process runs fine but I face super slow performance: fuzzy match completes 1% in 15 minutes.

I am attaching the workflow as well as sample files that I am using for the purpose.

Please share ways to improve the speed of this workflow.

slow fuzzy match.PNG

Table2.csv

fuzzy_match.yxmd

table1.csv

fmvizcaino · Accepted Answer

Hi @nimeshkhatri ,

One thing that I would do, since you have a lot of identical data, is to summarize your client before entering the fuzzy match tool.

Another thing, I have noticed that you are generating keys for each word and leaving some behind.

It depends on your data, but I would uncheck this option and use a find/replace tool to remove those common words before. Since you have a recordID, you can get your original company name later.

Let me know if that help you.

Best,

Fernando Vizcaino