Support for Fuzzy Match

Question

Looking for some assistance here related to designing a logic for last name match using fuzzy match tool- dataset ~1.4M records

Sample table format:

COL-A

COL-B

COL-C

COL-D

COL-E

X

S1

LASTNAME1

S2

LASTNAME10

X

S1

LASTNAME2

S2

LASTNAME2

Y

S3

LASTNAME3

S4

LASTNAME11

Y

S3

LASTNAME3

S4

LASTNAME12

Y

S3

LASTNAME4

S4

LASTNAME13

Z

S5

LASTNAME5

S6

LASTNAME5

Lastnames in input dataset have spelling, character mismatches. So algorithm needed is to identify similarity score using all text-word-character comparisons possible

Need to match COL-C vs COL-E for similarity and need to output the score and key used to map in fuzzy match for each of the input rows. I  am trying to setup fuzzy match on this table with custom setting but since dataset has total 1.3M rows , currently it is taking long time and returning 100s of billions of rows from fuzzy match tool output which is inefficient as it is comparing record of each lastname from col C with lastname from Col E.  I need it to only run comparison on same row value. for ex. row 1 of sample table-  LASTNAME1  should only get compared with LASTNAME10 as it is on same row and not with other row values. So expected output rows from fuzzy match tool should be less than original 1.3M input rows as some of row ideally will not get matched depending on threshold setup and not billion rows will be outputted which i am currently struggling with

Sample Expected result.

COL-A

COL-B

COL-C

COL-D

COL-E

Score

MapKey

X

S1

LASTNAME1

S2

LASTNAME10

Score1

MK1

X

S1

LASTNAME2

S2

LASTNAME2

Score 2

MK2

Y

S3

LASTNAME3

S4

LASTNAME11

Score 3

MK3

Y

S3

LASTNAME3

S4

LASTNAME12

Score 4

MK4

Y

S3

LASTNAME4

S4

LASTNAME13

Score 5

MK5

Z

S5

LASTNAME5

S6

LASTNAME5

Score 6

MK6

Any sample workflow using batch macro-fuzzy match technique or using other matching functions that will help address my issue will be really helpful here

KGT · Answer

So, you just want to go line by line rather than one dataset vs the other. That's not how the Fuzzy Match tool is set up, and to go line by line, I would suggest a formula or something for speed rather than the fuzzy match tool. The tool will do so much more than what you are after and so take longer.

If going to a formula style approach, then ideally the Jaro-Winkler algorithm is the formula you would be looking for. Take a look at this macro, I have not tested or even downloaded it, so no idea on speed etc.

This is the idea for a formula to be added natively (please star it to vote for it): https://community.alteryx.com/t5/Alteryx-Designer-Desktop-Ideas/Levenshtein-and-Jaro-Winkler-Distances-for-formula/idi-p/765142

And this is a description of the Jaro-Winkler algorithm: https://medium.com/codex/best-libraries-for-fuzzy-matching-in-python-cbb3e0ef87dd#:~:text=all%20human%20misspellings.-,Jaro%2DWinkler%20Distance,-The%20Jaro%2DWinkler

If you would really like to persist with the tool, then try a batch or iterative macro feeding in one line (or unrelated lines) at a time. Take a look at "Help > Sample Workflows > Use Scripting and.... > Build a Macro > Merge to a master file with Fuzzy Matching" for an example using macros. However, that example is not the same as yours, and so will require a lot of stripping out... I would advise looking at that sample, but not using it as your starter.

For the easiest solution of an iterative macro line by line, build an iterative macro with parallel sample tools to split the top row off and then iterate the rest. But 1.3m iterations is going to be hefty.

Take a look at this: https://community.alteryx.com/t5/Alteryx-Designer-Desktop-Discussions/FUZZY-MATCH-line-by-line/td-p/952682