Help with matching

Question

Hello,

I have performed a reconciliation on 2 datasets, and have come across 1 issue with some false breaks.

Where 1 side contains 1/2 latin/special characters while other side has full english. (Using UTF-8, have tried every other as well, UTF-8 came out as best)

For Example:

George Fernandes           George Ferñandes              coming as a break.

Any was to apply a matching % here to figure out high match rows. Tried fuzzy match, but it's too complex and not giving right.

Around 400 rows of data, and around 25 columns, out of which around 8 pair of columns containings strings like addresses are getting compared, and multiple instances of diverse latin/weird/special characters leading to mismatches.

danilang · Answer

Hi @ShantanuDagar

There's a function called DecomposeUnicodeForMatch([fieldname]) which effectively replaces all accented characters with their unaccented equivalents.    Create a new match column on each source and apply the following formula

Lowercase(DecomposeUnicodeForMatch(trim([Fieldname])))

This will normalize the inputs and allow joins to work correctly.

In general, the algorithm is this,

* normalize inputs
* join to find exact matches
* Fuzzy Match on remaining* Add a source field to each dataset.  The value in the source field doesn't matter but it needs to be the same in dataset and different between datasets
* union the data
* Add a recordID to the unioned data
* configure the fuzzy match in Merge mode with the [Source] as the Source ID field and RecordId as the recordID field
* Pick the correct match style, Name, Address.
* Use a unique tool to get the unique pairs of RecordID, RecordID2.

You'll need to use a fuzzy match tool for each column you'll be comparing

Dan

rzdodson · Answer

@ShantanuDagar would you be able to share a sample data set around this that mirrors what you are seeing? I would imagine that a lot of this can be handled through Regex_Replace logic, but it'll be helpful to see what we are dealing with to write the workflow logic.