Hello,
I have performed a reconciliation on 2 datasets, and have come across 1 issue with some false breaks.
Where 1 side contains 1/2 latin/special characters while other side has full english. (Using UTF-8, have tried every other as well, UTF-8 came out as best)
For Example:
George Fernandes George Ferñandes coming as a break.
Any was to apply a matching % here to figure out high match rows. Tried fuzzy match, but it's too complex and not giving right.
Around 400 rows of data, and around 25 columns, out of which around 8 pair of columns containings strings like addresses are getting compared, and multiple instances of diverse latin/weird/special characters leading to mismatches.
@ShantanuDagar would you be able to share a sample data set around this that mirrors what you are seeing? I would imagine that a lot of this can be handled through Regex_Replace logic, but it'll be helpful to see what we are dealing with to write the workflow logic.
There's a function called DecomposeUnicodeForMatch([fieldname]) which effectively replaces all accented characters with their unaccented equivalents. Create a new match column on each source and apply the following formula
Lowercase(DecomposeUnicodeForMatch(trim([Fieldname])))
This will normalize the inputs and allow joins to work correctly.
In general, the algorithm is this,
You'll need to use a fuzzy match tool for each column you'll be comparing
Dan