Alteryx Designer Desktop Discussions

caltang · ‎01-06-2023

I've attached a sample dataset.

My goal is to fuzzy match all the item descriptions, and then group them together. Thereafter, I intend to make a unique ID for each one.

I have a few problems:

Fuzzy Match + Grouping
Data Cleanliness
Unique ID

For fuzzy match + grouping, I find that out of the 800k rows, only 500k+ of them came out in the final output. The accuracy of which is also rather questionable as I can see similar rows but having different groups. How can I account for all 800k+ rows?

With regards to data cleanliness, the input of 800k+ rows is not perfect. Some item descriptions are just special characters, some just dates, and some just repeated words in a row. I'm not sure how else to clean them besides removing the unwanted characters and uppercasing only letters. My initial thought was that letters (forming long strings) will be good as a match because I can set the Fuzzy Match threshold to about 20-30%, and have a custom setting that tracks words (Best of Jaro & Levenshtein). Not sure if I'm on the right track.

Lastly, I tried to generate a Unique ID for each group, so that when I join the final output of a fuzzy match + grouping to the original dataset with record IDs, I get to see the original item description and a group column next to it. But I have only been using Formula + Tile to create it, and I have tried Uuidcreate(). I need somethin that is static, and will not change after each run. It has to be unique to each group, and not manually created like with my formula + tile tool.

My expected output is something like this:

Record ID	Item Description	Unique ID
1	ApronCaste	21321321313
2	ApronLARGE	21321321313
3	APRONAPRONAPRON	21321321313

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/

Felipe_Ribeir0 · ‎01-07-2023

Hi @caltang

See if the attached workflow works for you. At least considering your sample input, the groups of unique keys seems to make sense and will be static because its based on the Description field.

caltang · ‎01-07-2023

Hi @Felipe_Ribeir0

With the sample I gave you, it worked fine. However, with a larger dataset of about 1,000 rows (same types of data), the Fuzzy Match gives mix results, and the final output has duplicates which is not what is required at this point in time.

I tried it with an even larger dataset of about 400k rows, and the workflow just stopped at 50% loading (Fuzzy Match).

Perhaps I need to be clearer in my requirements... let me try again:

I have a dataset (800k rows) that contains 1 column (Descriptions), and they are very messy. Some have dates only, some are special characters only, some are alphanumeric, some are duplicate words within the same row (very long duplicates like: Example Given is Here Example Given is Here Example Given is Here.... x15), and the length of words are not the same.
I am trying to assign a unique ID (alphanumeric, standard, fixed each run) to each of the rows based on the groups they belong to.
Groups in this case is easy if we use our eyes to match, but the process to match over 800k rows is too time consuming. That's why Fuzzy Match + Group was used. (Sidenote: Are there any ML tools besides these two that can cleanse + group them?)

What I tried was:

Cleanse the data by removing numbers, punctuations, whitespace, tabs etc using Data Cleanse.
Filter out the blanks.
Fuzzy Match that output and then group them.

The results:

The accuracy of the group is questionable since I did cross check with my eyes, and some groups were assigned wrongly.
Not all records were returned (I left it unchecked for Fuzzy Match). So, out of 800k, only 500k showed up in the final output.

This result is assuming the fuzzy match loads and finishes (about 3 mins each run). Sometimes, more changes means the process gets stuck at 50%.

Not sure where to go from here.

@Felipe_Ribeir0 your help is much appreciated, by the way.

P.S: Can I suggest to users to clean up the dataset first? User inputs are messy, but I'm sure there are other columns that can be used for the job.

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/

Felipe_Ribeir0 · ‎01-08-2023

Hi @caltang

I believe that its gonna be hard to work through your requirements without seeing/having the full dataset, or at least a big enough sample of it. The duplicates can be removed with an unique tool, right?

Anyway, lets see if someone else can help so.

caltang · ‎01-08-2023

Hi @Felipe_Ribeir0,

Really sorry that I cannot provide you the full dataset due to privacy. The snippet I shared with you is also made up, but follows the style shown in the dataset.

Regarding the Unique tool, that depends - unique based on Record ID, Description - right?

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/

caltang · ‎01-08-2023

In addition, the input had 25 rows, but the output shown in your image has 9 rows.

16 rows were not grouped and assigned a Unique ID...

Is there a way to assign all 25 rows using Fuzzy Match? By extension to the 800k+ rows as well?

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/

Felipe_Ribeir0 · ‎01-08-2023

Hi @caltang

Try this new version. The rows with MatchScore had more than 70% of match, the rows with null values had less

caltang · ‎01-08-2023

Hi @Felipe_Ribeir0 !

Thanks for the prompt response. However that's still 19 records, short of 6.

Side question: Is it possible to fuzzy match a fuzzy matched group? Will that be process intensive / useless?

Thanks!

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/

Felipe_Ribeir0 · ‎01-08-2023

Ohh, its because of the first unique tool, please remove it!

caltang · ‎01-08-2023

Fantastic! Thanks @Felipe_Ribeir0

I'll experiment further with the 800k+ rows, but for now I think this is the best solution.

I hope for more people to chime in!

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/

Alteryx Designer Desktop Discussions

Fuzzy Match + Group very messy dataset (800k+ rows) + UniqueID

[Sharing] Custom Formula Function 'IsPrime' to Jud...

Re: What’s the best way to save each day’s working...

Re: Alteryx Load Time

Re: Accounting for Zero Records on Email Output

Re: How to identify if field name used in the work...