Hi Folks,
Hope everything is well
I'm dealing with matching some titles that are truncated or suffixed that I need to match internal identifiers with and I'm having issues using fussy matching to get these.
The issue is I can't do overall similarity as I don't want to match the 'Draft' version of documents so it needs to be does the first x characters match, and that's the ENTIRE name of what we have if that makes sense?
Match 1 | Match 2 | Percentage |
ABC | ABC123 | 50% |
DEF | DEF | 100% |
GHI | GIH | 33% |
Hey @XIIJAMIEIIX, hopefully I understood your ask correctly. I can imagine there's a simpler way of doing this but this works for your test data and I believe should scale fine if you want to check it with a wider data set:
Hello @DataNath that almost works! The results are what I was expecting however the source is coming from 2 different CSV and are being matched against each other.
Sort of like this (and sorry I completely forgot to mention the 2 CSV's)
Match 1 - CSV Input 1 | Match 2 - CSV Input 2 | Percentage |
ABC | ABC, 123 | 50% |
DEF | DEF | 100% |
GHI | GIH | 33% |
Hey @XIIJAMIEIIX, how does this updated version look? It just takes the 2 streams and merges them before stripping out any punctuation etc to just leave the letters and digits before running the same comparison. If this still isn't exactly what you're looking for then can you please fully explain and provide better examples to minimise back and forth?
It looks to be matching with totally random records rather than similar matches, I think it might be due to the first join matching on the Record position so it's not aligning items correctly and therefore giving incorrect percentages if that makes sense?
I'm not too sure if this is an easy way to explain it
Match 1 - CSV1 | Match 2 - CSV2 |
ABC | GIH |
DEF | ABC, 123 |
GHI | DEF |
Match 1 - CSV1 | Match 2 - CSV2 | Percentage |
ABC | ABC, 123 | 50% |
DEF | DEF | 100% |
GHI | GIH | 33% |
I think you might want to stick to the fuzzy matching tool unless you essentially want to rebuild this. The main reason is that in your examples it makes sense what you're trying to do, but with real data, the characters might be off by one position indicating a zero percent match for an extra character. I.e. the name of a company may be 1 Capital Place in one list and One Capital Place in another list. With what you're trying to do, you will get a zero percent match. With a small list, these may be easy to see, but if you're building automation, my guess is you have a very large list. Have you used the actual fuzzy match tool yet? If so, share what you've got so far, and we can help adjust if necessary.
Yeah I did try Fuzzy match but it was giving out far too many random hits and wasn't accurate enough so this is why I needed to pivot to something that would look at Character 1 from one list and Character 2 from another, I did take into account grammar and such with normalizing everything with regex to try and improve my matches.
We are only caring about a threshold of 90% match with everything that falls under requiring a manual review to attempt to match afterward.
so the example I have here one of the values is a 56% match but ideally it should only be 33% as only the first character matches in order
@XIIJAMIEIIX Try this workflow. I used iterative macros to evaluate letter by letter based on the rules you defined.