Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Fuzzy Match Nickname Table for Abbreviations & Acronyms

KayTannee
6 - Meteoroid

Hi,

 

I'm trying to fuzzy match 2 data sets, I have an individuals data each with a list of products that they own. 1 Data set contains a clean name, with the name written out in its entirety with no errors. In the other data set, it is heavily abbreviated. Most are multi word, and the abbreviations and acronyms could mean a few different things.

 

For example

"Product Item Name Something Else"

Would be something like

"Prod It Nm Som El" or "PIN SE"

 

I've built a custom 'Nick Names' table, for the common type of abbreviations and acronyms, all is working great, where the mapping is 1 word to 1 word.

 

"Prod", "Product"

"It", "Item"

"Nm", "Name"

etc

 

But it doesn't work when there's spaces within the values.

 

"PIN", "Product Item Number"

 

Is this just a limitation of the tool that it can't handle Acronyms?

6 REPLIES 6
MarqueeCrew
20 - Arcturus
20 - Arcturus

@KayTannee,

 

Prior to fuzzy matching, I would suggest some degree of normalization of data values in you data.  As a generic tool, the fuzzy matching works as a single-tool.  Even then, the fuzzy match tool has many switches and options.  In production, I suggest multiple levels of matching and combinations of tools.  Exact matches should happen outside of the tool (e.g. JOIN).  When you have acronyms, you might want to use a FIND REPLACE tool and attempt an  exact match on the standardized form of the term (e.g. PIN).

 

If you are working with large volumes of data, you can use the following trick of the trade:

  • Store the entire search string as a MD5 Hash value
  • Once the string is associated to a result value, create a table of the HASH and the RESULT

If you remember the input value and what the answer becomes, then the next time that you see that same input value, you can short-cut the whole matching process and jump straight to the answer.  I know that this won't work for all fuzzy matching exercises, but if you are looking for MarqueeCrew in Michigan over and over again, it will.

 

Cheers,

Mark

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
fiorano
8 - Asteroid

Hi, 

 

I have a similar issue,  I have a large dataset with itemID as a field with many differing item names which need to be mapped to a correct Master Item.

 

Is the any way of using fuzzy match to score the likelyhood of a ItemID matching a lookup table ??

 

An example of the itemID is

 

  • Speedy Supplies
  • SpeedySuppliers
  • Speedy. 
  • Speedy Supplies 10Pack
  • Speedy Supplies Pack10

 

Ideally I would need to map all these to 'Speedy Supplies'  - at the moment I'm doing this manually in Excel which is taking an age!  

 

Thanks for any advice,

KayTannee
6 - Meteoroid

Hi,

 

Standard fuzzy match should be reasonably decent for that kind of matching. I find that so long as the words are long enough there's usually a good match...

 

You sometimes have to do some formula replacing to standardise the names a bit prior, so long as they're reasonably simple its do able. 

 

I found this video really helpful. Understanding how the generate keys function works was the key (pardon the pun) to understanding how to get it to work as expected.

https://community.alteryx.com/t5/Live-Training/Live-Training-Fuzzy-Matching-Intermediate-Users/td-p/...

 

My current workflow involves using a nickname table to convert the values, so long as my to convert value to doesn't contain a space it works perfectly and is matching spot on. Which would probably work for you too.

 

However if there's a space involved in the to convert to group, then it just gets ignored as far as I can tell. 

 

If my abbreviations were a 1:1 it would probably be easy enough to stream through the values I'm trying to look up into, and swap them out for their correct abbreviations. But the 2, 3 or 4 letter abbreviation may be used multiple times for multiple complete different meanings. 

 

I think my plan is to filter the fuzzy nickname table for anything with a space in it, then go through each of my values swapping it out multiple times for each entry where there's a match. So if the abbreviation has multiple, I end up with multiple versions of the replacement one for each possibility. Then fuzzy match against these, taking the one with the highest match %.

 

The issue is, I have not actually worked out how to swap out values against a list, generating a new line for each swap. Or if there's multiple abbreviations per value, how to do it for each recursively. It would be a simple process in something like VBA, I'm sure Alteryx can do it - just dont know how yet. I'll respond back here once work it out and if it works.

 

Ie.

 

If the nickname table contained :

 

ABC = Alpha Beta Charlie

ABC = A Bunch Carrots

ABC = Apple Banana Celery

DEF = Delta Echo

DEF = Duck Elephant Fox

 

 

Would become for the value 'ABC DEF' :

 

'Alpha Beta Charlie Delta Echo'

'Alpha Beta Charlie Duck Elephant Fox'

'Alpha Beta Charlie DEF'

'A Bunch Carrots Delta Echo'

'A Bunch Carrots Duck Elephant Fox'

'A Bunch Carrots DEF'

'Apple Banana Celery Delta Echo'

'Apple Banana Celery Duck Elephant Fox'

'Apple Banana Celery DEF'

'ABC Delta Echo'

'ABC Duck Elephant Fox'

'ABC DEF'

 

 

 

KayTannee
6 - Meteoroid

@

 

Hi Thank you,

 

I'm investigating the Find Replace tool, which would do the job - except it only does a 1:1 replacement. It doesn't seem to handle duplicates at all, I think if I can work out how to do the below then it would fix it. But I can't find any tool that will do it, or work out a combination of tools to do it.

 

If the nickname table contained :

 

ABC = Alpha Beta Charlie

ABC = A Bunch Carrots

ABC = Apple Banana Celery

DEF = Delta Echo

DEF = Duck Elephant Fox

 

 

Would become for the value 'ABC DEF' :

 

'Alpha Beta Charlie Delta Echo'

'Alpha Beta Charlie Duck Elephant Fox'

'Alpha Beta Charlie DEF'

'A Bunch Carrots Delta Echo'

'A Bunch Carrots Duck Elephant Fox'

'A Bunch Carrots DEF'

'Apple Banana Celery Delta Echo'

'Apple Banana Celery Duck Elephant Fox'

'Apple Banana Celery DEF'

'ABC Delta Echo'

'ABC Duck Elephant Fox'

'ABC DEF'

KayTannee
6 - Meteoroid

 

Capture.PNG

I'm sure there's going to be better ways of doing this, but this is basically what I've ended up with that does the job. First I filter for values where GroupName Contains a space, used the append to build a unique record for every combination. Then use a formula to swap out where it exists in list, filter to remove any where no match occurred. Then rinse repeat. In this example I'm only assuming there'll be a maximum of 3 abbreviations in any one value, which isn't great. I've seen something called the iterative macro, not sure how to use it yet, but guessing will allow me to loop this logic until there's no results returned.

 

I use this formula to find/replace, so that it'll only do it if its a 'word' not a couple letters in the middle of another word. Probably should split it out so not Elseifs, incase the same abbreviation features twice in the same string. ie 'PIN foo PIN bar PIN'. But it'll do for my current purposes and as an example.

 

Each value before formula replace is assigned an ID. Then after the fuzzy match, sort by best match, unique and then I join back to the original value, swapping out the formula replaced value. 

 

IF Contains([Description]," "+[Name]+" ") THEN 
Replace([Description]," "+[name]+" "," "+[GroupName]+" ")
ELSEIF Left([Description], length([Name]) + 1) = [Name]+" " THEN
[groupname] + right([Description], length([Description])-length([name]))
ELSEIF right([Description], length([name]) + 1) = " "+[Name] THEN
Left([Description], length([Description])-length([name])) + [groupname]
ELSE
""
ENDIF

 

 

 

ejsemez
5 - Atom

Hi 

I process my data in an excel macro that remove some words and also remove characters. Then I process it for keyword extraction. this gives me a list of topics for each row. I tried fuzzywuzzy to find similar texts but it is difficult to achieve an acceptable results when the topics are too technical. for example:

the algorithm doesn't find similarity between the title ALARM BOUNCING - GSM alarms : FE On Site    and the title GSM Beta TS | Tech is going down.

If I apply my macro then I get both tickets categorized as GSM. This is too "square". I wonder if there is a way to train the model to spot synonyms and or expanded acronyms 

Labels
Top Solution Authors