I've attached a sample dataset.
My goal is to fuzzy match all the item descriptions, and then group them together. Thereafter, I intend to make a unique ID for each one.
I have a few problems:
- Fuzzy Match + Grouping
- Data Cleanliness
- Unique ID
For fuzzy match + grouping, I find that out of the 800k rows, only 500k+ of them came out in the final output. The accuracy of which is also rather questionable as I can see similar rows but having different groups. How can I account for all 800k+ rows?
With regards to data cleanliness, the input of 800k+ rows is not perfect. Some item descriptions are just special characters, some just dates, and some just repeated words in a row. I'm not sure how else to clean them besides removing the unwanted characters and uppercasing only letters. My initial thought was that letters (forming long strings) will be good as a match because I can set the Fuzzy Match threshold to about 20-30%, and have a custom setting that tracks words (Best of Jaro & Levenshtein). Not sure if I'm on the right track.
Lastly, I tried to generate a Unique ID for each group, so that when I join the final output of a fuzzy match + grouping to the original dataset with record IDs, I get to see the original item description and a group column next to it. But I have only been using Formula + Tile to create it, and I have tried Uuidcreate(). I need somethin that is static, and will not change after each run. It has to be unique to each group, and not manually created like with my formula + tile tool.
My expected output is something like this:
| Record ID | Item Description | Unique ID |
| 1 | ApronCaste | 21321321313 |
| 2 | ApronLARGE | 21321321313 |
| 3 | APRONAPRONAPRON | 21321321313 |