Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Fuzzy Matching letters and numbers (e.g. 0 and O; 1 and l; 8 and B) in 10 char string

LuisKahn
7 - Meteor

Hello,
This post is about duplicate supplier invoices detection for audit purposes.

The data input is a list of invoices prepared by an OCR character recognition software that interprets the scanned invoice numbers (alphanumeric , between 1 and 15 characters wide).

The fuzzy part comes to mind because the OCR software often mixes up numbers, letters and special characters. For example
- the number one is often interpreted as a / or \.
- the number 0 is often interpreted as an uppercase letter O.
- 8 and B
- etc. etc.

I'd rather not replace those Os by 0s because sometimes they are valid.

Fuzzy match (workflow attached) does propose some interesting matches but I can't teach it to detect the matches in records 1 and 2 or 5 and 6. (the Bottom part of the attached workflow finds exact matches and that part is working fine. The Fuzzy match is in the upper part of the workflow)

If someone in the Community has insights please be aware that a whole lot of auditors in the world will praise it much and then again !

Luis

5 REPLIES 5
danilang
19 - Altair
19 - Altair

Hi @LuisKahn 

 

Here's one way you can go about it.

 

WF.png

 

It all starts with the confusion list.

 

List.png

 

This is list of all letters that can be confused with one confusion group per row

 

After splitting this list and your unique Invoice_Numbers to rows, the rest of the workflow joins letters from the invoice # with the possible confused letters.  Any place there's a match, the letter from the Invoice # is replaced with the generic Letter, i.e. "B" or "8" are replaced by "B".  The summarize builds a Unconfused invoice number and uses this as the unique field in the second Only Unique macro.  The result is this

 

Result.png

 

Note that I added Record 381(1st "0" replaced with an "O") to your input data to show that multiple matches can be found.

 

Of course, the accuracy is only as good as the values in the Confusion List, so make sure you capture all possibles.  The auditors probably have an exhaustive list.

 

Dan

LuisKahn
7 - Meteor

Looks awesome Dan,

thanks for your quick reply

I will test it ASAP to accept and star it.

thanks again

Luis

LuisKahn
7 - Meteor
Hello again Dan, it has been a while but after trying to remove confusion I created more of it. Replacing B by 8 works fine if the OCR made a mistake. But sometimes OCR does things right. When workflow "Remove Confusion.yxzp" is blindly replacing all "B"s by "8"s, I end up adding confusion in my invoice numbers. Let's take invoices 82B and 82C as an example. "Remove Confusion.yxzp" will create the following join of all possible confused letters (if your browser garbles the table below I can resend it in Excel): RecordID Invoice_Number Amount ColumnID ConfusionID Letters GenericLetter 1 8 100 1 3 8 8 1 2 100 2 1 B 100 3 3 B 8 2 8 100 1 3 8 8 2 2 100 2 2 C 100 3 At the end of the day my "unconfused" 82B invoice became invoice number 828 which is confusing. I can't think around the problem on how to add a probability index somehow to say that 82B maybe 828 or may be not But Dan, "Remove Confusion.yxzp" did help me find several duplicate records. Should I mark it as "solution accepted" and start another discussion/thread ? thanks again, Luis
danilang
19 - Altair
19 - Altair

Hi @LuisKahn 

 

I don't think your process can be made completely automatic.  The underlying issue is that OCR can silently make mistakes, and from an auditors point of view, they have to catch all of these mistakes.   What my process does is flag all the possible cases where any invoice number can be mistaken for another invoice number.   It's a very greedy algorithm(i.e. gives more false positives to avoid missing any actual negatives) to make up for the fact that the OCR process can range anywhere from almost perfect to abysmal depending on the state of the original document.  Once you have possible matches, you can then analyze the matching invoices further to see if they have anything to differentiate them.  Do the addresses match? company names, amounts, etc.  

 

The probability index that you're referring to can't act on just the invoice # field, since there's no way to tell if the OCR process messed up in that one field.  However, if you compare the results of multiple fields, you can develop an algorithm that says if XX% of the target fields(Invoice amount, line items, customer name, etc) match, then it's the same invoice. This process will work because the errors in the OCR process will tend to cancel themselves out.   

 

For your example, 82B will become 828, which may also be a valid invoice.  If you have both of these in your input list, you have to proceed on the assumption that they could be the same, because you don't know if the OCR process made an error in one or the other or both.  

 

Starting a new thread will bring this up to top to the heap, but it's not really an Alteryx issue at this point.  I think it's crossed the line to a GAAP (Generally Accepted Accounting Principles) issue

 

Dan

 

LuisKahn
7 - Meteor

thanks a lot Dan for your thorough insights.

It's a relief to have someone to discuss with and learn from about this topic

regards

Luis

Labels