I've attached a sample dataset.
My goal is to fuzzy match all the item descriptions, and then group them together. Thereafter, I intend to make a unique ID for each one.
I have a few problems:
For fuzzy match + grouping, I find that out of the 800k rows, only 500k+ of them came out in the final output. The accuracy of which is also rather questionable as I can see similar rows but having different groups. How can I account for all 800k+ rows?
With regards to data cleanliness, the input of 800k+ rows is not perfect. Some item descriptions are just special characters, some just dates, and some just repeated words in a row. I'm not sure how else to clean them besides removing the unwanted characters and uppercasing only letters. My initial thought was that letters (forming long strings) will be good as a match because I can set the Fuzzy Match threshold to about 20-30%, and have a custom setting that tracks words (Best of Jaro & Levenshtein). Not sure if I'm on the right track.
Lastly, I tried to generate a Unique ID for each group, so that when I join the final output of a fuzzy match + grouping to the original dataset with record IDs, I get to see the original item description and a group column next to it. But I have only been using Formula + Tile to create it, and I have tried Uuidcreate(). I need somethin that is static, and will not change after each run. It has to be unique to each group, and not manually created like with my formula + tile tool.
My expected output is something like this:
Record ID | Item Description | Unique ID |
1 | ApronCaste | 21321321313 |
2 | ApronLARGE | 21321321313 |
3 | APRONAPRONAPRON | 21321321313 |
Hi @caltang
See if the attached workflow works for you. At least considering your sample input, the groups of unique keys seems to make sense and will be static because its based on the Description field.
With the sample I gave you, it worked fine. However, with a larger dataset of about 1,000 rows (same types of data), the Fuzzy Match gives mix results, and the final output has duplicates which is not what is required at this point in time.
I tried it with an even larger dataset of about 400k rows, and the workflow just stopped at 50% loading (Fuzzy Match).
Perhaps I need to be clearer in my requirements... let me try again:
What I tried was:
The results:
This result is assuming the fuzzy match loads and finishes (about 3 mins each run). Sometimes, more changes means the process gets stuck at 50%.
Not sure where to go from here.
@Felipe_Ribeir0 your help is much appreciated, by the way.
P.S: Can I suggest to users to clean up the dataset first? User inputs are messy, but I'm sure there are other columns that can be used for the job.
Hi @caltang
I believe that its gonna be hard to work through your requirements without seeing/having the full dataset, or at least a big enough sample of it. The duplicates can be removed with an unique tool, right?
Anyway, lets see if someone else can help so.
Hi @Felipe_Ribeir0,
Really sorry that I cannot provide you the full dataset due to privacy. The snippet I shared with you is also made up, but follows the style shown in the dataset.
Regarding the Unique tool, that depends - unique based on Record ID, Description - right?
In addition, the input had 25 rows, but the output shown in your image has 9 rows.
16 rows were not grouped and assigned a Unique ID...
Is there a way to assign all 25 rows using Fuzzy Match? By extension to the 800k+ rows as well?
Hi @caltang
Try this new version. The rows with MatchScore had more than 70% of match, the rows with null values had less
Hi @Felipe_Ribeir0 !
Thanks for the prompt response. However that's still 19 records, short of 6.
Side question: Is it possible to fuzzy match a fuzzy matched group? Will that be process intensive / useless?
Thanks!
Fantastic! Thanks @Felipe_Ribeir0
I'll experiment further with the 800k+ rows, but for now I think this is the best solution.
I hope for more people to chime in!