I've attached a sample dataset.
My goal is to fuzzy match all the item descriptions, and then group them together. Thereafter, I intend to make a unique ID for each one.
I have a few problems:
For fuzzy match + grouping, I find that out of the 800k rows, only 500k+ of them came out in the final output. The accuracy of which is also rather questionable as I can see similar rows but having different groups. How can I account for all 800k+ rows?
With regards to data cleanliness, the input of 800k+ rows is not perfect. Some item descriptions are just special characters, some just dates, and some just repeated words in a row. I'm not sure how else to clean them besides removing the unwanted characters and uppercasing only letters. My initial thought was that letters (forming long strings) will be good as a match because I can set the Fuzzy Match threshold to about 20-30%, and have a custom setting that tracks words (Best of Jaro & Levenshtein). Not sure if I'm on the right track.
Lastly, I tried to generate a Unique ID for each group, so that when I join the final output of a fuzzy match + grouping to the original dataset with record IDs, I get to see the original item description and a group column next to it. But I have only been using Formula + Tile to create it, and I have tried Uuidcreate(). I need somethin that is static, and will not change after each run. It has to be unique to each group, and not manually created like with my formula + tile tool.
My expected output is something like this:
Record ID | Item Description | Unique ID |
1 | ApronCaste | 21321321313 |
2 | ApronLARGE | 21321321313 |
3 | APRONAPRONAPRON | 21321321313 |
About this side question, i dont know to be honest, i never used this with more then 10.000 rows, so i never had/dont know how to make it more a more optimized/efficient process. 800k rows seems much rows for this, maybe if you have a lot of recurrent descriptions every time that a user run the process, you can process them a first time, keep the result stored somewhere and use this result.
Side question: Is it possible to fuzzy match a fuzzy matched group? Will that be process intensive / useless?
👍