Looking for Starter Kits? Head to the Community Gallery! Now formatted as YXIs for easy installation.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Classifying strings into groups based on a sample classified file

luismarc
5 - Atom

Hello everyone

 

I have tried to find a solution but haven't suceeded so I apologize if this is a simple common problem.

 

File A

Has 2 columns showing (A) keyword searched on a website (this column contains unique values) and (B) a classification (which in this case has repeated values)
This classication has been manually done and is a list of 300,000 classified keywords.
Example (see File A attached )
keyword - classification
bed - furniture
tv - electrical appliances

tomb raider - games

bread - food

beer - drinks

table - furniture

 

File B

File B contains a single column showing a list of unique keywords that haven't been classified yet.

File C

File C is a list of expected classifications. That is a list of unique values containing all the possible classifications for the "classifications column. I havent' provided this here, but in another tool (PowerBI) this third file was a requirement to use FuzzyMatch.


Goal
The Goal is to use the large sample of classified keywords in File A to automatically classify the keywords in File B into a new column (column B, just like in file A).

The result will be a file with 2 columns. In column A you should have all the keywords from File B and in column B the classification.

Some error is expected, but after some visual validation this result will be joined with File A, enriching the list of classified keywords for future use of this project.

What I have tried so far
I have tried to use Fuzzy Match and Groups, but all the examples that I have found weren't similar to mine.

I hope I was able to explain it in a clearly way annd i truly appreciate if someone can give me a hand

Luis

1 REPLY 1
morr-co
10 - Fireball

Hi @luismarc - I have attached a workflow that takes an alternate approach to fuzzy matching. Fuzzy matching can be a great tool but I've often found it has it's limitations, particularly when you are comparing a single column of data in which many of the values are a single word. You often have to set the match threshold very high - even leveraging the phonetic conversions - and the results aren't always helpful. Rather, I usually have more success with simple manipulations. In this case, I have used 3:

 

  • Identify exact matches
  • Cleanse the Keywords in File A and File B and identify exact matches on the cleansed value
  • Using the cleansed field above, split multi-word values and determine if there are any single-word matches between the two sources

You could obviously continue to build on this. Hope this is helpful!

Labels