Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Inexact lookup - fuzzy match question

geordie986
6 - Meteoroid

Hi all,

 

I'm working on flow which is looking for partial matches in a second table, based on input from a first table.

For example, with the data I mocked up below, I am looking for the string 818156205 in the e-mail body table, and the record "Dear user, document DN#81856205 has been created." should be return based on partial match (notice the fourth digit missing hence exact find/replace not possible).

 

Search  key
818156205

 

E-mail body
Dear user, documents DN#818149588/818149587 have been created.
Dear user, document DN#81856205 has been created.

 

Despite following https://community.alteryx.com/t5/Tool-Mastery/Tool-Mastery-Fuzzy-Match/ta-p/45485 , and playing around with treshold and match style but no luck - output from Fuzzy Match is always empty. Any suggestion on what I am doing wrong? 

 

 

image.png

 

7 REPLIES 7
ChrisTX
15 - Aurora

I'm getting an Unhandled Exception whenever I select Match Style Custom, so I can't test this out.

 

On the main screen temporarily choose the option for Generate Keys Only.  Then run the workflow and make sure all records have a Key.

Depending on the options you select on the Config screen, you may see that your record id 3 does not get a Key assigned.

If a record does not have a Key, it will never match.  

 

After you have a Configuration where all of your records have keys....

  then un-select Generate Keys and use the other options highlighted below. 

  Since your data is more like a "code" versus English words, you'll want to focus on match types like Jaro and/or Levenshtein Distance.  Try Google to see how Levenshtein works.

 

The option for Output Match Score will help you adjust the Match Threshold.

 

Screenshot 0 2024-01-10 050154.png

Screenshot 1 2024-01-10 045608.png

Screenshot 2 2024-01-10 045734.png

 

Screenshot 3 2024-01-10 045911.png

Chris

geordie986
6 - Meteoroid

Thank you Chris,

I did check and keys are getting generated for all, see below - I added also another search key "818149587" to verify the matching logic with "DN#818149588/818149587" entry:

 

image.png

 

 

 

Tried with mulitple combinations of Jaro and Levenshtein Distance with different match treshold but no luck. The only way I can make the fuzzy match works is to manipuale one of the mails content to have the excact string, which defeats the purpose of this tool 😅

 

image.png

AndrewDMerrill
13 - Pulsar

It may work better if you first isolate the numbers to check, Then split with Text To Columns Tool (to rows). This would allow you to compare Apples to Apples instead of Apples to ApplesApples. It would work much better for the Fuzzy Match Tool, and I think it may solve your problem.

Brankl
7 - Meteor

I agree with Andrew. The RegEx tool is probably the easiest way to identify strings of consecutive numbers within your text (while knowing when to stop and not to combine multiple strings into one)

geordie986
6 - Meteoroid

I splitted the strings to isolate single numbers, but unfortunately also this method didn't yield any better results. The input now looks like this:

image.png

However, brute-forcing all combinations of built-in / custom match style / key generation options and even selecting 1% match treshold, I either get only the exact match (Jaro or Levenshtein or best of both):

 

image.png

Or, using address match, getting another partial match but still not the one I'm looking for:

image.png

I tried also to pad the shorter string "81856205" left or right with zeros or other characters, but no change. Any further suggestions?

ChrisTX
15 - Aurora

Because you're trying to match on only numbers, I'm thinking Fuzzy Match may not be the right tool because the tool's first processing phase generates keys, and only matched keys (exact match on the key) are sent to the second processing phase.  The second phase is where logic like Levenshtein is used.  But your "similar" numeric values won't get past the phase one requirement for "exact match on the key".

 

The only thing you could possibly do with the key generation process is to set the Maximum Key Length to something like 4, which would require the last 4 (or first 4?) digits to match exactly, before the second phase actually uses logic like Levenshtein or Jaro.

 

Essentially, you want to calculate a Levenshtein and/or Jaro value for all combinations from your two sources.

 

Unfortunately Alteryx doesn't have a Levenshtein or Jaro function.

 

You could use the Append tool to link every Source A row to every Source B row.  Then call Python code to calculate values for  Levenshtein and Jaro.

 

Chris

 

 

geordie986
6 - Meteoroid

Thank you, I believe 4 digits keys may still be suboptimal solution because, give that I am parsing users' manually generated data, typo may be at any point of the string.

I will check with Pyhton libraries whethere there is some function that can address similar requiremnet.

 

Daniele

Labels