We are celebrating the 10-year anniversary of the Alteryx Community! Learn more and join in on the fun here.
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Fuzzy Match + Group very messy dataset (800k+ rows) + UniqueID

caltang
17 - Castor
17 - Castor

I've attached a sample dataset. 

 

My goal is to fuzzy match all the item descriptions, and then group them together. Thereafter, I intend to make a unique ID for each one.

 

I have a few problems:

  1. Fuzzy Match + Grouping
  2. Data Cleanliness 
  3. Unique ID

For fuzzy match + grouping, I find that out of the 800k rows, only 500k+ of them came out in the final output. The accuracy of which is also rather questionable as I can see similar rows but having different groups. How can I account for all 800k+ rows?

 

With regards to data cleanliness, the input of 800k+ rows is not perfect. Some item descriptions are just special characters, some just dates, and some just repeated words in a row. I'm not sure how else to clean them besides removing the unwanted characters and uppercasing only letters. My initial thought was that letters (forming long strings) will be good as a match because I can set the Fuzzy Match threshold to about 20-30%, and have a custom setting that tracks words (Best of Jaro & Levenshtein). Not sure if I'm on the right track.

 

Lastly, I tried to generate a Unique ID for each group, so that when I join the final output of a fuzzy match + grouping to the original dataset with record IDs, I get to see the original item description and a group column next to it. But I have only been using Formula + Tile to create it, and I have tried Uuidcreate(). I need somethin that is static, and will not change after each run. It has to be unique to each group, and not manually created like with my formula + tile tool.

 

My expected output is something like this:

 

Record IDItem DescriptionUnique ID
1

ApronCaste

21321321313
2ApronLARGE21321321313
3APRONAPRONAPRON

21321321313

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
11 REPLIES 11
Felipe_Ribeir0
16 - Nebula

About this side question, i dont know to be honest, i never used this with more then 10.000 rows, so i never had/dont know how to make it more a more optimized/efficient process. 800k rows seems much rows for this, maybe if you have a lot of recurrent descriptions every time that a user run the process, you can process them a first time, keep the result stored somewhere and use this result.

 

Side question: Is it possible to fuzzy match a fuzzy matched group? Will that be process intensive / useless? 

Adamyde
6 - Meteoroid

👍

Labels
Top Solution Authors