community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
#SANTALYTICS

The highly anticipated Alteryx Community tradition is back! We hope you'll join us!

Learn More
We will be upgrading the Gallery to our 2019.4 release this Saturday, December 7th beginning at 9:00am MT. We expect the outage to take last approx. 2.5 hours.

How to group UPCs based on Similarity and evaluate Anomalies

Meteor

Greetings,

 

In an attempt to cleanup our Brand information using UPC  info, I am interested in identifying duplicate brands that may be represented/spelled differently.  

 

This is basically what I would be interested in producing:

clipboard_image_0.png

 

Rather than use a similarity score on the Brand name itself, we are interested in grouping UPCs based on how similar they are (perhaps looking at the first 11 digits) and then evaluating the brands for that UPC group to determine if it should be reviewed or not to match the most common Brand found for that UPC group. For example, in the sample data I provided, we would want to correct Bee's Knees Corporation to the already existing brand Bee's Knees.

 

In the past I've been able to successfully group based on string similarity, but it's seems challenging to group actual numbers (in this case the first 11 digits of that number). Any thoughts on how to tackle this?

 

Thank you for any guidance!

 

User friendly version:

 

UPCUPC GroupManufacturerBrandAction
00000000002000000000000200Bee's KneesBee's Knees 
00000000002010000000000200Bee's KneesBee's Knees 
00000000002020000000000200Bee's KneesBee's Knees 
00000000002030000000000200Bee's KneesBee's Knees CorporationREVIEW
00000000002100000000000200Bee's KneesBee's Knees 
00000000005000000000000500MMMMMM 
00000000005100000000000500MMMMMM 
00000000005200000000000500MMMMMM 
00000000005400000000000500MMMMMM Inc.REVIEW
00000000005600000000000500MMMMulti Medium Media 
00000000012000000000001200Peppy ColaPeppy Cola 
00000000012100000000001200Peppy ColaPeppy Cola 
00000000012200000000001200Peppy ColaPeppy Cola 
00000000012300000000001200Peppy ColaPepsiREVIEW
00000044444440000004444444Rhonda'sRhonda's 
00000044444550000004444444Rhonda'sRhonda's 
00000777788800000077778880Killer DaveKiller Dave 
00000777788850000077778880Killer DaveKiller Dave 
00009999999900000999999990SamsonSamson 
00009999999970000999999990SamsonSamson 
Alteryx Certified Partner

Does something like this work for you?

 

clipboard_image_0.png

 

Here, both MMM Inc. and Multi Medium Media are flagged for review because they are both duplicates (with fewer occurrences) within the group 0000...1200

Meteor

Hi,

 

Thank you for your response. I forgot to mention that UPC Group would need to be derived somehow. How would I create the logic for grouping the UPCs? UPC Group would have to be created, as it is not an existing field

 

 

Thank you.

Quasar

A formula tool would allow you to do this with the formula below

 

 

padright(left([UPC],11),13,"0")

 

 

This removes the left 11 characters from UPC, and then pads the string with zeroes until you get to 13 digits long. 

Quasar

Hi @anthony123 ,

The  fuzzy match  on  brand provides same results .  

Hope this helps .

clipboard_image_0.png

Labels