Alteryx Designer Desktop Discussions

XIIJAMIEIIX · ‎06-10-2024

Hi Folks,

Hope everything is well

I'm dealing with matching some titles that are truncated or suffixed that I need to match internal identifiers with and I'm having issues using fussy matching to get these.

The issue is I can't do overall similarity as I don't want to match the 'Draft' version of documents so it needs to be does the first x characters match, and that's the ENTIRE name of what we have if that makes sense?

Match 1	Match 2	Percentage
ABC	ABC123	50%
DEF	DEF	100%
GHI	GIH	33%

DataNath · ‎06-10-2024

Hey @XIIJAMIEIIX, hopefully I understood your ask correctly. I can imagine there's a simpler way of doing this but this works for your test data and I believe should scale fine if you want to check it with a wider data set:

XIIJAMIEIIX · ‎06-11-2024

Hello @DataNath that almost works! The results are what I was expecting however the source is coming from 2 different CSV and are being matched against each other.

Sort of like this (and sorry I completely forgot to mention the 2 CSV's)

Match 1 - CSV Input 1	Match 2 - CSV Input 2	Percentage
ABC	ABC, 123	50%
DEF	DEF	100%
GHI	GIH	33%

DataNath · ‎06-11-2024

Hey @XIIJAMIEIIX, how does this updated version look? It just takes the 2 streams and merges them before stripping out any punctuation etc to just leave the letters and digits before running the same comparison. If this still isn't exactly what you're looking for then can you please fully explain and provide better examples to minimise back and forth?

XIIJAMIEIIX · ‎06-11-2024

It looks to be matching with totally random records rather than similar matches, I think it might be due to the first join matching on the Record position so it's not aligning items correctly and therefore giving incorrect percentages if that makes sense?

XIIJAMIEIIX · ‎06-11-2024

I'm not too sure if this is an easy way to explain it

Match 1 - CSV1	Match 2 - CSV2
ABC	GIH
DEF	ABC, 123
GHI	DEF

Match 1 - CSV1	Match 2 - CSV2	Percentage
ABC	ABC, 123	50%
DEF	DEF	100%
GHI	GIH	33%

jdminton · ‎06-11-2024

I think you might want to stick to the fuzzy matching tool unless you essentially want to rebuild this. The main reason is that in your examples it makes sense what you're trying to do, but with real data, the characters might be off by one position indicating a zero percent match for an extra character. I.e. the name of a company may be 1 Capital Place in one list and One Capital Place in another list. With what you're trying to do, you will get a zero percent match. With a small list, these may be easy to see, but if you're building automation, my guess is you have a very large list. Have you used the actual fuzzy match tool yet? If so, share what you've got so far, and we can help adjust if necessary.

XIIJAMIEIIX · ‎06-11-2024

Yeah I did try Fuzzy match but it was giving out far too many random hits and wasn't accurate enough so this is why I needed to pivot to something that would look at Character 1 from one list and Character 2 from another, I did take into account grammar and such with normalizing everything with regex to try and improve my matches.

We are only caring about a threshold of 90% match with everything that falls under requiring a manual review to attempt to match afterward.

so the example I have here one of the values is a 56% match but ideally it should only be 33% as only the first character matches in order

jdminton · ‎06-11-2024

@XIIJAMIEIIX Try this workflow. I used iterative macros to evaluate letter by letter based on the rules you defined.

Alteryx Designer Desktop Discussions

Comparing 2 columns strings have giving percentage scores based on similarity