Fuzzy Match + Group very messy dataset (800k+ rows) + UniqueID

I've attached a sample dataset.

My goal is to fuzzy match all the item descriptions, and then group them together. Thereafter, I intend to make a unique ID for each one.

I have a few problems:

Fuzzy Match + Grouping
Data Cleanliness
Unique ID

For fuzzy match + grouping, I find that out of the 800k rows, only 500k+ of them came out in the final output. The accuracy of which is also rather questionable as I can see similar rows but having different groups. How can I account for all 800k+ rows?

With regards to data cleanliness, the input of 800k+ rows is not perfect. Some item descriptions are just special characters, some just dates, and some just repeated words in a row. I'm not sure how else to clean them besides removing the unwanted characters and uppercasing only letters. My initial thought was that letters (forming long strings) will be good as a match because I can set the Fuzzy Match threshold to about 20-30%, and have a custom setting that tracks words (Best of Jaro & Levenshtein). Not sure if I'm on the right track.

Lastly, I tried to generate a Unique ID for each group, so that when I join the final output of a fuzzy match + grouping to the original dataset with record IDs, I get to see the original item description and a group column next to it. But I have only been using Formula + Tile to create it, and I have tried Uuidcreate(). I need somethin that is static, and will not change after each run. It has to be unique to each group, and not manually created like with my formula + tile tool.

My expected output is something like this:

Record ID	Item Description	Unique ID
1	ApronCaste	21321321313
2	ApronLARGE	21321321313
3	APRONAPRONAPRON	21321321313

SampleData_FM.xlsx

Regex

Parse

Transformation

Fuzzy Match

Join

Expression

Data Investigation

Accepted answers

All comments

Felipe_Ribeir0

Hi @caltang

See if the attached workflow works for you. At least considering your sample input, the groups of unique keys seems to make sense and will be static because its based on the Description field.

fuzzy.yxmd

caltang

Hi @Felipe_Ribeir0

With the sample I gave you, it worked fine. However, with a larger dataset of about 1,000 rows (same types of data), the Fuzzy Match gives mix results, and the final output has duplicates which is not what is required at this point in time.

I tried it with an even larger dataset of about 400k rows, and the workflow just stopped at 50% loading (Fuzzy Match).

Perhaps I need to be clearer in my requirements... let me try again:

I have a dataset (800k rows) that contains 1 column (Descriptions), and they are very messy. Some have dates only, some are special characters only, some are alphanumeric, some are duplicate words within the same row (very long duplicates like: Example Given is Here Example Given is Here Example Given is Here.... x15), and the length of words are not the same.
I am trying to assign a unique ID (alphanumeric, standard, fixed each run) to each of the rows based on the groups they belong to.
Groups in this case is easy if we use our eyes to match, but the process to match over 800k rows is too time consuming. That's why Fuzzy Match + Group was used. (Sidenote: Are there any ML tools besides these two that can cleanse + group them?)

What I tried was:

Cleanse the data by removing numbers, punctuations, whitespace, tabs etc using Data Cleanse.
Filter out the blanks.
Fuzzy Match that output and then group them.

The results:

The accuracy of the group is questionable since I did cross check with my eyes, and some groups were assigned wrongly.
Not all records were returned (I left it unchecked for Fuzzy Match). So, out of 800k, only 500k showed up in the final output.

This result is assuming the fuzzy match loads and finishes (about 3 mins each run). Sometimes, more changes means the process gets stuck at 50%.

Not sure where to go from here.

@Felipe_Ribeir0 your help is much appreciated, by the way.

P.S: Can I suggest to users to clean up the dataset first? User inputs are messy, but I'm sure there are other columns that can be used for the job.

Felipe_Ribeir0

Hi @caltang

I believe that its gonna be hard to work through your requirements without seeing/having the full dataset, or at least a big enough sample of it. The duplicates can be removed with an unique tool, right?

Anyway, lets see if someone else can help so.

fuzzy (1).yxmd

caltang

Hi @Felipe_Ribeir0,

Really sorry that I cannot provide you the full dataset due to privacy. The snippet I shared with you is also made up, but follows the style shown in the dataset.

Regarding the Unique tool, that depends - unique based on Record ID, Description - right?

caltang

In addition, the input had 25 rows, but the output shown in your image has 9 rows.

16 rows were not grouped and assigned a Unique ID...

Is there a way to assign all 25 rows using Fuzzy Match? By extension to the 800k+ rows as well?

Felipe_Ribeir0

Hi @caltang

Try this new version. The rows with MatchScore had more than 70% of match, the rows with null values had less

fuzzy (1).yxmd

caltang

Hi @Felipe_Ribeir0 !

Thanks for the prompt response. However that's still 19 records, short of 6.

Side question: Is it possible to fuzzy match a fuzzy matched group? Will that be process intensive / useless?

Thanks!

Felipe_Ribeir0

Ohh, its because of the first unique tool, please remove it!

fuzzy (1).yxmd

caltang

Fantastic! Thanks @Felipe_Ribeir0

I'll experiment further with the 800k+ rows, but for now I think this is the best solution.

I hope for more people to chime in!

Felipe_Ribeir0

About this side question, i dont know to be honest, i never used this with more then 10.000 rows, so i never had/dont know how to make it more a more optimized/efficient process. 800k rows seems much rows for this, maybe if you have a lot of recurrent descriptions every time that a user run the process, you can process them a first time, keep the result stored somewhere and use this result.

Side question: Is it possible to fuzzy match a fuzzy matched group? Will that be process intensive / useless?

Adamyde

👍

Quick Links

This months top contributors

atcodedog05 19598

Qiu 15878

binu_acs 15708

MarqueeCrew 13708

apathetichell 13703