Hello,
This post is about duplicate supplier invoices detection for audit purposes.
The data input is a list of invoices prepared by an OCR character recognition software that interprets the scanned invoice numbers (alphanumeric , between 1 and 15 characters wide).
The fuzzy part comes to mind because the OCR software often mixes up numbers, letters and special characters. For example
- the number one is often interpreted as a / or \.
- the number 0 is often interpreted as an uppercase letter O.
- 8 and B
- etc. etc.
I'd rather not replace those Os by 0s because sometimes they are valid.
Fuzzy match (workflow attached) does propose some interesting matches but I can't teach it to detect the matches in records 1 and 2 or 5 and 6. (the Bottom part of the attached workflow finds exact matches and that part is working fine. The Fuzzy match is in the upper part of the workflow)
If someone in the Community has insights please be aware that a whole lot of auditors in the world will praise it much and then again !
Luis
Solved! Go to Solution.
Hi @LuisKahn
Here's one way you can go about it.
It all starts with the confusion list.
This is list of all letters that can be confused with one confusion group per row
After splitting this list and your unique Invoice_Numbers to rows, the rest of the workflow joins letters from the invoice # with the possible confused letters. Any place there's a match, the letter from the Invoice # is replaced with the generic Letter, i.e. "B" or "8" are replaced by "B". The summarize builds a Unconfused invoice number and uses this as the unique field in the second Only Unique macro. The result is this
Note that I added Record 381(1st "0" replaced with an "O") to your input data to show that multiple matches can be found.
Of course, the accuracy is only as good as the values in the Confusion List, so make sure you capture all possibles. The auditors probably have an exhaustive list.
Dan
Looks awesome Dan,
thanks for your quick reply
I will test it ASAP to accept and star it.
thanks again
Luis
Hi @LuisKahn
I don't think your process can be made completely automatic. The underlying issue is that OCR can silently make mistakes, and from an auditors point of view, they have to catch all of these mistakes. What my process does is flag all the possible cases where any invoice number can be mistaken for another invoice number. It's a very greedy algorithm(i.e. gives more false positives to avoid missing any actual negatives) to make up for the fact that the OCR process can range anywhere from almost perfect to abysmal depending on the state of the original document. Once you have possible matches, you can then analyze the matching invoices further to see if they have anything to differentiate them. Do the addresses match? company names, amounts, etc.
The probability index that you're referring to can't act on just the invoice # field, since there's no way to tell if the OCR process messed up in that one field. However, if you compare the results of multiple fields, you can develop an algorithm that says if XX% of the target fields(Invoice amount, line items, customer name, etc) match, then it's the same invoice. This process will work because the errors in the OCR process will tend to cancel themselves out.
For your example, 82B will become 828, which may also be a valid invoice. If you have both of these in your input list, you have to proceed on the assumption that they could be the same, because you don't know if the OCR process made an error in one or the other or both.
Starting a new thread will bring this up to top to the heap, but it's not really an Alteryx issue at this point. I think it's crossed the line to a GAAP (Generally Accepted Accounting Principles) issue
Dan
thanks a lot Dan for your thorough insights.
It's a relief to have someone to discuss with and learn from about this topic
regards
Luis