Alteryx Designer Desktop Discussions

caltang · ‎01-06-2023

I've attached a sample dataset.

My goal is to fuzzy match all the item descriptions, and then group them together. Thereafter, I intend to make a unique ID for each one.

I have a few problems:

Fuzzy Match + Grouping
Data Cleanliness
Unique ID

For fuzzy match + grouping, I find that out of the 800k rows, only 500k+ of them came out in the final output. The accuracy of which is also rather questionable as I can see similar rows but having different groups. How can I account for all 800k+ rows?

With regards to data cleanliness, the input of 800k+ rows is not perfect. Some item descriptions are just special characters, some just dates, and some just repeated words in a row. I'm not sure how else to clean them besides removing the unwanted characters and uppercasing only letters. My initial thought was that letters (forming long strings) will be good as a match because I can set the Fuzzy Match threshold to about 20-30%, and have a custom setting that tracks words (Best of Jaro & Levenshtein). Not sure if I'm on the right track.

Lastly, I tried to generate a Unique ID for each group, so that when I join the final output of a fuzzy match + grouping to the original dataset with record IDs, I get to see the original item description and a group column next to it. But I have only been using Formula + Tile to create it, and I have tried Uuidcreate(). I need somethin that is static, and will not change after each run. It has to be unique to each group, and not manually created like with my formula + tile tool.

My expected output is something like this:

Record ID	Item Description	Unique ID
1	ApronCaste	21321321313
2	ApronLARGE	21321321313
3	APRONAPRONAPRON	21321321313

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/

Felipe_Ribeir0 · ‎01-08-2023

About this side question, i dont know to be honest, i never used this with more then 10.000 rows, so i never had/dont know how to make it more a more optimized/efficient process. 800k rows seems much rows for this, maybe if you have a lot of recurrent descriptions every time that a user run the process, you can process them a first time, keep the result stored somewhere and use this result.

Side question: Is it possible to fuzzy match a fuzzy matched group? Will that be process intensive / useless?

Adamyde · ‎01-08-2023

👍

Alteryx Designer Desktop Discussions

Fuzzy Match + Group very messy dataset (800k+ rows) + UniqueID