Looking for some assistance here related to designing a logic for last name match using fuzzy match tool- dataset ~1.4M records
Sample table format:
COL-A | COL-B | COL-C | COL-D | COL-E |
X | S1 | LASTNAME1 | S2 | LASTNAME10 |
X | S1 | LASTNAME2 | S2 | LASTNAME2 |
Y | S3 | LASTNAME3 | S4 | LASTNAME11 |
Y | S3 | LASTNAME3 | S4 | LASTNAME12 |
Y | S3 | LASTNAME4 | S4 | LASTNAME13 |
Z | S5 | LASTNAME5 | S6 | LASTNAME5 |
Lastnames in input dataset have spelling, character mismatches. So algorithm needed is to identify similarity score using all text-word-character comparisons possible
Need to match COL-C vs COL-E for similarity and need to output the score and key used to map in fuzzy match for each of the input rows. I am trying to setup fuzzy match on this table with custom setting but since dataset has total 1.3M rows , currently it is taking long time and returning 100s of billions of rows from fuzzy match tool output which is inefficient as it is comparing record of each lastname from col C with lastname from Col E. I need it to only run comparison on same row value. for ex. row 1 of sample table- LASTNAME1 should only get compared with LASTNAME10 as it is on same row and not with other row values. So expected output rows from fuzzy match tool should be less than original 1.3M input rows as some of row ideally will not get matched depending on threshold setup and not billion rows will be outputted which i am currently struggling with
Sample Expected result.
COL-A | COL-B | COL-C | COL-D | COL-E | Score | MapKey |
X | S1 | LASTNAME1 | S2 | LASTNAME10 | Score1 | MK1 |
X | S1 | LASTNAME2 | S2 | LASTNAME2 | Score 2 | MK2 |
Y | S3 | LASTNAME3 | S4 | LASTNAME11 | Score 3 | MK3 |
Y | S3 | LASTNAME3 | S4 | LASTNAME12 | Score 4 | MK4 |
Y | S3 | LASTNAME4 | S4 | LASTNAME13 | Score 5 | MK5 |
Z | S5 | LASTNAME5 | S6 | LASTNAME5 | Score 6 | MK6 |
Any sample workflow using batch macro-fuzzy match technique or using other matching functions that will help address my issue will be really helpful here
So, you just want to go line by line rather than one dataset vs the other. That's not how the Fuzzy Match tool is set up, and to go line by line, I would suggest a formula or something for speed rather than the fuzzy match tool. The tool will do so much more than what you are after and so take longer.
If going to a formula style approach, then ideally the Jaro-Winkler algorithm is the formula you would be looking for. Take a look at this macro, I have not tested or even downloaded it, so no idea on speed etc.
This is the idea for a formula to be added natively (please star it to vote for it): https://community.alteryx.com/t5/Alteryx-Designer-Desktop-Ideas/Levenshtein-and-Jaro-Winkler-Distanc...
And this is a description of the Jaro-Winkler algorithm: https://medium.com/codex/best-libraries-for-fuzzy-matching-in-python-cbb3e0ef87dd#:~:text=all%20huma...
If you would really like to persist with the tool, then try a batch or iterative macro feeding in one line (or unrelated lines) at a time. Take a look at "Help > Sample Workflows > Use Scripting and.... > Build a Macro > Merge to a master file with Fuzzy Matching" for an example using macros. However, that example is not the same as yours, and so will require a lot of stripping out... I would advise looking at that sample, but not using it as your starter.
For the easiest solution of an iterative macro line by line, build an iterative macro with parallel sample tools to split the top row off and then iterate the rest. But 1.3m iterations is going to be hefty.
Take a look at this: https://community.alteryx.com/t5/Alteryx-Designer-Desktop-Discussions/FUZZY-MATCH-line-by-line/td-p/...