community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

FuzzyMatch: Wrong MatchScore for Jaro Distance?

Atom

Hello Everyone,

 

I was conducting some basic tests on the FuzzyMatching tool using simple test words. For some reason(s) I am unable to make it to common available results. Here are typically match scores of JARO implementations.

 

Score:

0.9444444 'MARTHA' 'MARHTA'
0.8222222 'DWAYNE' 'DUANE'
0.8962963 'JELLYFISH' 'SMELLYFISH'
0.7666667 'DIXON' 'DICKSONX'

  

(You may get the scoring from here https://rosettacode.org/wiki/Jaro_distance or here https://asecuritysite.com/forensics/simstring )

 

Alteryx gives me the following Matchscoring:

 

Result.PNG

 

 

 

 

 

 

 While "Martha" is correct, "Dwayne" & "Jellyfish" are close - "Dicksonx" is even not rated. Am I doing something wrong or is the MatchScore something different?

 

This is the input data:

 

Input.PNG

 

 

 

 

 

 

 

 

 

 

Match-Settings (Jaro Distance):

 

Matcher1.PNG

 

 

 

Matcher2.PNG

 

If you have any idea why this is happening please let me know.  BTW: The project is/was based on some template for FuzzyMatching from here.

 

Best

Gerald

 

I am having the same issue.  I have been trying to match the following and not receiving any results even with the match percentage down to 0.

 

men and mens.

 

Is this something inherent in the algorithm that since the words are so few characters that it would never match? 

Alteryx
Alteryx

Hi @Gerald,

 

I just wanted to let you know we're taking a look at this one internally for you to see what's going on, as some of our support group also agrees the results seem to be incorrect. I'll let you know as soon as I have any further updates!

 

@dmccombie,

 

I wasn't able to reproduce a "0" match score between the words "mens" and "men", at least that would relate to Gerald's problem above. You may wish to post your specific configuration and how you are comparing the data so we have a better idea of what you are seeing!

Mike Spoula
Solutions Architect - Services
Alteryx
Atom

Hello All.

 

It looks like Alteryx did their own implementation of Jaro, and it is wise that internal “know-how” is kept as a secret. On the other side users may have the need to explain how they generated specific (join) results using FuzzyMatch. Look, the test words (Jellyfish, etc.)  are taken from the Alteryx reference itself (link to rosetta code with the examples+results, see first post) – but Alteryx calculates….. let´s say “differently”. 

 

Link:

https://help.alteryx.com/current/FuzzyEditMatchOptions.htm?

Content:

Unbenannt.JPG

 

Maybe Alteryx took some shortcuts to reduce CPU times, and that makes sense to me as I don´t want to wait for results until the next century. But at least I need to know the basics of them, so I am aware of it and can explain what is happening and why. It might also be a good idea to specify what algorithm is really used: Jaro or Jaro-Winkler or some other Jaro variant flying around.

 

This may apply to the implemented Levenshtein algorithm, too.

 

Best

Gerald

 

Alteryx
Alteryx

Hi @Gerald,

 

I have an update for you based on some discussions with our internal teams!

 

The Fuzzy Match tool logic does not exactly match the referenced Jaro distance calculation, but rather has a few ‘Alteryx’ tweaks as you've noted. In light of this, we will work internally to update our documentation and/or the tool configuration to more accurately reflect the methodology differences. In reviewing this report, we did find one aspect to the calculation that could be edited to more closely align with the standard definition, and our engineering team has accepted a defect related to that fix and will work on it for a future release. You will be notified when that fix is released.

Mike Spoula
Solutions Architect - Services
Alteryx
Alteryx
Alteryx

Hi @Gerald,

 

I came across this thread again at Inspire that another customer had noticed and wanted to provide an update.

 

I wanted to let you (and others) know that the difference in Jaro Distance has been updated starting with version 2018.4 and above and should now be much closer (if not exact) to the  previously referenced formula.

Mike Spoula
Solutions Architect - Services
Alteryx
Labels