Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Fuzzy Match + Group very messy dataset (800k+ rows) + UniqueID

caltang
17 - Castor
17 - Castor

I've attached a sample dataset. 

 

My goal is to fuzzy match all the item descriptions, and then group them together. Thereafter, I intend to make a unique ID for each one.

 

I have a few problems:

  1. Fuzzy Match + Grouping
  2. Data Cleanliness 
  3. Unique ID

For fuzzy match + grouping, I find that out of the 800k rows, only 500k+ of them came out in the final output. The accuracy of which is also rather questionable as I can see similar rows but having different groups. How can I account for all 800k+ rows?

 

With regards to data cleanliness, the input of 800k+ rows is not perfect. Some item descriptions are just special characters, some just dates, and some just repeated words in a row. I'm not sure how else to clean them besides removing the unwanted characters and uppercasing only letters. My initial thought was that letters (forming long strings) will be good as a match because I can set the Fuzzy Match threshold to about 20-30%, and have a custom setting that tracks words (Best of Jaro & Levenshtein). Not sure if I'm on the right track.

 

Lastly, I tried to generate a Unique ID for each group, so that when I join the final output of a fuzzy match + grouping to the original dataset with record IDs, I get to see the original item description and a group column next to it. But I have only been using Formula + Tile to create it, and I have tried Uuidcreate(). I need somethin that is static, and will not change after each run. It has to be unique to each group, and not manually created like with my formula + tile tool.

 

My expected output is something like this:

 

Record IDItem DescriptionUnique ID
1

ApronCaste

21321321313
2ApronLARGE21321321313
3APRONAPRONAPRON

21321321313

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
11 REPLIES 11
Felipe_Ribeir0
16 - Nebula

About this side question, i dont know to be honest, i never used this with more then 10.000 rows, so i never had/dont know how to make it more a more optimized/efficient process. 800k rows seems much rows for this, maybe if you have a lot of recurrent descriptions every time that a user run the process, you can process them a first time, keep the result stored somewhere and use this result.

 

Side question: Is it possible to fuzzy match a fuzzy matched group? Will that be process intensive / useless? 

Adamyde
6 - Meteoroid

👍

Labels