Looking for some assistance here related to designing a logic for last name match using fuzzy match tool- dataset ~1.4M records
Sample table format:
COL-A | COL-B | COL-C | COL-D | COL-E |
X | S1 | LASTNAME1 | S2 | LASTNAME10 |
X | S1 | LASTNAME2 | S2 | LASTNAME2 |
Y | S3 | LASTNAME3 | S4 | LASTNAME11 |
Y | S3 | LASTNAME3 | S4 | LASTNAME12 |
Y | S3 | LASTNAME4 | S4 | LASTNAME13 |
Z | S5 | LASTNAME5 | S6 | LASTNAME5 |
Lastnames in input dataset have spelling, character mismatches. So algorithm needed is to identify similarity score using all text-word-character comparisons possible
Need to match COL-C vs COL-E for similarity and need to output the score and key used to map in fuzzy match for each of the input rows. I am trying to setup fuzzy match on this table with custom setting but since dataset has total 1.3M rows , currently it is taking long time and returning 100s of billions of rows from fuzzy match tool output which is inefficient as it is comparing record of each lastname from col C with lastname from Col E. I need it to only run comparison on same row value. for ex. row 1 of sample table- LASTNAME1 should only get compared with LASTNAME10 as it is on same row and not with other row values. So expected output rows from fuzzy match tool should be less than original 1.3M input rows as some of row ideally will not get matched depending on threshold setup and not billion rows will be outputted which i am currently struggling with
Sample Expected result.
COL-A | COL-B | COL-C | COL-D | COL-E | Score | MapKey |
X | S1 | LASTNAME1 | S2 | LASTNAME10 | Score1 | MK1 |
X | S1 | LASTNAME2 | S2 | LASTNAME2 | Score 2 | MK2 |
Y | S3 | LASTNAME3 | S4 | LASTNAME11 | Score 3 | MK3 |
Y | S3 | LASTNAME3 | S4 | LASTNAME12 | Score 4 | MK4 |
Y | S3 | LASTNAME4 | S4 | LASTNAME13 | Score 5 | MK5 |
Z | S5 | LASTNAME5 | S6 | LASTNAME5 | Score 6 | MK6 |
Any sample workflow using batch macro-fuzzy match technique or using other matching functions that will help address my issue will be really helpful here