Inexact lookup - fuzzy match question
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi all,
I'm working on flow which is looking for partial matches in a second table, based on input from a first table.
For example, with the data I mocked up below, I am looking for the string 818156205 in the e-mail body table, and the record "Dear user, document DN#81856205 has been created." should be return based on partial match (notice the fourth digit missing hence exact find/replace not possible).
Search key |
818156205 |
E-mail body |
Dear user, documents DN#818149588/818149587 have been created. |
Dear user, document DN#81856205 has been created. |
Despite following https://community.alteryx.com/t5/Tool-Mastery/Tool-Mastery-Fuzzy-Match/ta-p/45485 , and playing around with treshold and match style but no luck - output from Fuzzy Match is always empty. Any suggestion on what I am doing wrong?
- Labels:
- Fuzzy Match
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I'm getting an Unhandled Exception whenever I select Match Style Custom, so I can't test this out.
On the main screen temporarily choose the option for Generate Keys Only. Then run the workflow and make sure all records have a Key.
Depending on the options you select on the Config screen, you may see that your record id 3 does not get a Key assigned.
If a record does not have a Key, it will never match.
After you have a Configuration where all of your records have keys....
then un-select Generate Keys and use the other options highlighted below.
Since your data is more like a "code" versus English words, you'll want to focus on match types like Jaro and/or Levenshtein Distance. Try Google to see how Levenshtein works.
The option for Output Match Score will help you adjust the Match Threshold.
 
 
 
 
Chris
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thank you Chris,
I did check and keys are getting generated for all, see below - I added also another search key "818149587" to verify the matching logic with "DN#818149588/818149587" entry:
Tried with mulitple combinations of Jaro and Levenshtein Distance with different match treshold but no luck. The only way I can make the fuzzy match works is to manipuale one of the mails content to have the excact string, which defeats the purpose of this tool 😅
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
It may work better if you first isolate the numbers to check, Then split with Text To Columns Tool (to rows). This would allow you to compare Apples to Apples instead of Apples to ApplesApples. It would work much better for the Fuzzy Match Tool, and I think it may solve your problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I agree with Andrew. The RegEx tool is probably the easiest way to identify strings of consecutive numbers within your text (while knowing when to stop and not to combine multiple strings into one)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I splitted the strings to isolate single numbers, but unfortunately also this method didn't yield any better results. The input now looks like this:
However, brute-forcing all combinations of built-in / custom match style / key generation options and even selecting 1% match treshold, I either get only the exact match (Jaro or Levenshtein or best of both):
Or, using address match, getting another partial match but still not the one I'm looking for:
I tried also to pad the shorter string "81856205" left or right with zeros or other characters, but no change. Any further suggestions?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Because you're trying to match on only numbers, I'm thinking Fuzzy Match may not be the right tool because the tool's first processing phase generates keys, and only matched keys (exact match on the key) are sent to the second processing phase. The second phase is where logic like Levenshtein is used. But your "similar" numeric values won't get past the phase one requirement for "exact match on the key".
The only thing you could possibly do with the key generation process is to set the Maximum Key Length to something like 4, which would require the last 4 (or first 4?) digits to match exactly, before the second phase actually uses logic like Levenshtein or Jaro.
Essentially, you want to calculate a Levenshtein and/or Jaro value for all combinations from your two sources.
Unfortunately Alteryx doesn't have a Levenshtein or Jaro function.
You could use the Append tool to link every Source A row to every Source B row. Then call Python code to calculate values for Levenshtein and Jaro.
Chris
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thank you, I believe 4 digits keys may still be suboptimal solution because, give that I am parsing users' manually generated data, typo may be at any point of the string.
I will check with Pyhton libraries whethere there is some function that can address similar requiremnet.
Daniele
