We're excited to announce that we'll be partnering with Credly starting October 19th - see what this means and read the announcement blog here!

Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Fuzzy Match, Make Group

YLYONG
8 - Asteroid

Hi Alteryx Community,

 

I attempted using Fuzzy Match to extrapolate the nulls using information from the two tables however, it doesn't seem to produce the output that I was intending. 

 

Can't work out what's the missing piece.

 

I have chosen "Words: Best of Jaor & Levenshtein Distance"  for matching Item Description only, and 80% match threshold.

 

Thank you

 

YLYONG_0-1633009226059.png

 

 

ITEMITEM DESCRIPTIONPKSZIN_HGTIN_LENIN_WIDIN_WGTIN_VOLSOURCE
9310640000973GATORADE ARCTIC BLITZ 600ML[Null][Null][Null][Null][Null][Null]BBC
9310640000201GATORADE BLUE BOLT 1L[Null][Null][Null][Null][Null][Null]BBC
9310640620027GATORADE BLUE BOLT 600ML[Null][Null][Null][Null][Null][Null]BBC
9310640620119GATORADE FIERCE GRAPE 600ML12242922815,023BBC
9310640000706GATORADE FIERCE STRAWBERRY 600ML[Null][Null][Null][Null][Null][Null]BBC
9310640000102GATORADE LEMON LIME 1L6242919712,779BBC
9310640000195GATORADE ORANGE 1L6242919712,779BBC
9310640620010GATORADE ORANGE ICE 600ML[Null][Null][Null][Null][Null][Null]BBC
         
ITEMITEM DESCRIPTIONPKSZIN_HGTIN_LENIN_WIDIN_WGTIN_VOLSOURCE
2484928Gatorade sports drink 1l Lemon & lime6242919713,224TER
2484939Gatorade sports drink 1l Orange ice6242919713,224TER
2484940Gatorade sports drink 1l Blue bolt6242919713,224TER
7 REPLIES 7
YLYONG
8 - Asteroid

Found the issue, and completed the workflow. However, this workflow is too 'heavy' when deployed into actual data sets.

 

Need your help on suggestions or a better solution.  Thank you

 

YLYONG_0-1633066453206.png

 

YLYONG
8 - Asteroid

Hi Alteryx Community, wanting to know if anyone within this community have any ideas or suggestions??

Thank you

DawnDuong
12 - Quasar
12 - Quasar

hi @YLYONG 

Just to make sure I understand correctly, is the objective to fill the "null" values in the upper table with the lower table being the reference standard?

Dawn.

YLYONG
8 - Asteroid

@DawnDuong 

thank you for your response. The objective is to fill the "null" values taking both tables as reference.

DawnDuong
12 - Quasar
12 - Quasar

hi @YLYONG 

Since it is fuzzy match optimisation, it's important to understand the "expected" threshold/outcome. So i have a couple of clarifications to ensure correct understanding:

  1. What is the "expected" output of the sample data?
  2. Can you give an idea of how "heavy" it is and the size of the data that you are processing?
  3. On first look, you have 2 fuzzy match tools and 3 join tools and it seems strange that you are using the J output of the first Join Tool to do further fuzzy match. What's the objective that you try to achieve here?

 

Dawn.

YLYONG
8 - Asteroid

@DawnDuong 

 

1. The expected output of the sample data is

 

GATORADE ARCTIC BLITZ 600ML, GATORADE BLUE BOLT 600ML, GATORADE FIERCE STRAWBERRY 600ML, GATORADE ORANGE ICE 600ML to have the same item attributes as GATORADE FIERCE GRAPE 600ML

 

GATORADE BLUE BOLT 1L to have the same item attributes as GATORADE ORANGE 1L

 

2. The actual data size has 21,500 unique items including 1,000 "NULL" items.

3. The first fuzzy match wasn't able to extrapolate all the "NULL". I have to do a second round of fuzzy-matching to fill in all the "NULL"😐  I was asking for help if there's a better way to do this?

DawnDuong
12 - Quasar
12 - Quasar

hi @YLYONG 

Thanks for clarifying. I understand it better now.

In my opinion, probably Fuzzy Match is not the best choice for this use case. Fuzzy Match works best for cases where you have similar pronunciations (e.g. Stuart vs. Steward), known abbreviations (e.g. Co. vs. company) and other know patterns (e.g. mobile phone). 

For "GATORADE ARCTIC BLITZ 600ML" to be matched to "GATORADE FIERCE GRAPE 600ML" would require lowering the threshold a lot... because "Artic Blitz" and "Fierce Grape" sound very different.

Just from the 2 examples given above, it seems the real "keys" in classifications are:

1) Brand (e.g. Gatorade)

2) Size (e.g. 600ml vs. 1L)

And that the attributes columns pertain to the physical characteristics and packaging of different SKU's. (I'm making a guess here on what your use case is, let me know if i'm completely off...)

It may be more efficient for you to use a modified approach by having a) the list of all possible brands that you carry and b) list of all possible sizes that you carry and then combine it with fuzzy matching using "Merge" mode (or any other non-fuzzy matching approach).

(i'm guessing here of course -- In most FMCG retail you should have a readily available database of brands and sizes that you carry).

Hope this helps somewhat.

Dawn.

Labels