We are celebrating the 10-year anniversary of the Alteryx Community! Learn more and join in on the fun here.
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Fuzzy Match + Group very messy dataset (800k+ rows) + UniqueID

caltang
17 - Castor
17 - Castor

I've attached a sample dataset. 

 

My goal is to fuzzy match all the item descriptions, and then group them together. Thereafter, I intend to make a unique ID for each one.

 

I have a few problems:

  1. Fuzzy Match + Grouping
  2. Data Cleanliness 
  3. Unique ID

For fuzzy match + grouping, I find that out of the 800k rows, only 500k+ of them came out in the final output. The accuracy of which is also rather questionable as I can see similar rows but having different groups. How can I account for all 800k+ rows?

 

With regards to data cleanliness, the input of 800k+ rows is not perfect. Some item descriptions are just special characters, some just dates, and some just repeated words in a row. I'm not sure how else to clean them besides removing the unwanted characters and uppercasing only letters. My initial thought was that letters (forming long strings) will be good as a match because I can set the Fuzzy Match threshold to about 20-30%, and have a custom setting that tracks words (Best of Jaro & Levenshtein). Not sure if I'm on the right track.

 

Lastly, I tried to generate a Unique ID for each group, so that when I join the final output of a fuzzy match + grouping to the original dataset with record IDs, I get to see the original item description and a group column next to it. But I have only been using Formula + Tile to create it, and I have tried Uuidcreate(). I need somethin that is static, and will not change after each run. It has to be unique to each group, and not manually created like with my formula + tile tool.

 

My expected output is something like this:

 

Record IDItem DescriptionUnique ID
1

ApronCaste

21321321313
2ApronLARGE21321321313
3APRONAPRONAPRON

21321321313

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
11 REPLIES 11
Felipe_Ribeir0
16 - Nebula

Hi @caltang 

 

See if the attached workflow works for you. At least considering your sample input, the groups of unique keys seems to make sense and will be static because its based on the Description field.

 

Felipe_Ribeir0_2-1673106010473.png

 

 

 

 

 

 

caltang
17 - Castor
17 - Castor

Hi @Felipe_Ribeir0 

 

With the sample I gave you, it worked fine. However, with a larger dataset of about 1,000 rows (same types of data), the Fuzzy Match gives mix results, and the final output has duplicates which is not what is required at this point in time. 

 

I tried it with an even larger dataset of about 400k rows, and the workflow just stopped at 50% loading (Fuzzy Match). 

 

Perhaps I need to be clearer in my requirements... let me try again:

  1. I have a dataset (800k rows) that contains 1 column (Descriptions), and they are very messy. Some have dates only, some are special characters only, some are alphanumeric, some are duplicate words within the same row (very long duplicates like: Example Given is Here Example Given is Here Example Given is Here.... x15), and the length of words are not the same. 
  2. I am trying to assign a unique ID (alphanumeric, standard, fixed each run) to each of the rows based on the groups they belong to. 
  3. Groups in this case is easy if we use our eyes to match, but the process to match over 800k rows is too time consuming. That's why Fuzzy Match + Group was used. (Sidenote: Are there any ML tools besides these two that can cleanse + group them?)

What I tried was:

  1. Cleanse the data by removing numbers, punctuations, whitespace, tabs etc using Data Cleanse. 
  2. Filter out the blanks. 
  3. Fuzzy Match that output and then group them. 

The results:

  1. The accuracy of the group is questionable since I did cross check with my eyes, and some groups were assigned wrongly. 
  2. Not all records were returned (I left it unchecked for Fuzzy Match). So, out of 800k, only 500k showed up in the final output. 

This result is assuming the fuzzy match loads and finishes (about 3 mins each run). Sometimes, more changes means the process gets stuck at 50%. 

 

Not sure where to go from here.

 

@Felipe_Ribeir0 your help is much appreciated, by the way.

 

P.S: Can I suggest to users to clean up the dataset first? User inputs are messy, but I'm sure there are other columns that can be used for the job.

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
Felipe_Ribeir0
16 - Nebula

Hi @caltang 

 

I believe that its gonna be hard to work through your requirements without seeing/having the full dataset, or at least a big enough sample of it. The duplicates can be removed with an unique tool, right?

 

Felipe_Ribeir0_0-1673180091338.png

 

 

Anyway, lets see if someone else can help so.

caltang
17 - Castor
17 - Castor

Hi @Felipe_Ribeir0

 

Really sorry that I cannot provide you the full dataset due to privacy. The snippet I shared with you is also made up, but follows the style shown in the dataset. 

 

Regarding the Unique tool, that depends - unique based on Record ID, Description - right? 

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
caltang
17 - Castor
17 - Castor

In addition, the input had 25 rows, but the output shown in your image has 9 rows. 

 

16 rows were not grouped and assigned a Unique ID... 

 

Is there a way to assign all 25 rows using Fuzzy Match? By extension to the 800k+ rows as well?

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
Felipe_Ribeir0
16 - Nebula

Hi @caltang 

 

Try this new version. The rows with MatchScore had more than 70% of match, the rows with null values had less

 

Felipe_Ribeir0_1-1673180116801.png

 

Felipe_Ribeir0_2-1673180142252.png

 

caltang
17 - Castor
17 - Castor

Hi @Felipe_Ribeir0 !

 

Thanks for the prompt response. However that's still 19 records, short of 6. 

 

Side question: Is it possible to fuzzy match a fuzzy matched group? Will that be process intensive / useless? 

 

 

Thanks!

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
Felipe_Ribeir0
16 - Nebula

Ohh, its because of the first unique tool, please remove it!

 

Felipe_Ribeir0_0-1673180360652.png

 

caltang
17 - Castor
17 - Castor

Fantastic! Thanks @Felipe_Ribeir0 

 

I'll experiment further with the 800k+ rows, but for now I think this is the best solution. 

 

I hope for more people to chime in! 

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
Labels
Top Solution Authors