Identifying sales for a specific person using Fuzzy Match (or alternatives)?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi folks,
I have a situation I'm trying to resolve using the Fuzzy Match tool, but I just can't get it to work. I'm hoping someone can help me with my Fuzzy Match configuration, or maybe an alternative to using the tool.
I have two datasets, one has the sales for all of the sales people in my company. The other has the sales for a specific person in my company, let's call them Sam.
Goal: I want to run some summary stats on the sales data for everyone in the company, but I want to remove the records pertaining to Sam using the data set I have for their sales.
Fields Available:
In both data sets, I have the following fields:
* Sales Date/Time - The date and time when a sale occurred
* Item Category - the general category that an item sold belongs to. I have two possibilities, A and B
* SKU - the unique ID for items that were sold on a given day. More than one item could have been sold on a given day by a given salesperson, so this is NOT an Order ID.
* Price - The price that an individual item was sold for
* Quantity - the amount of an individual item that I sold
Notes:
* In my case, the time matters. The time of a sale should be the exact same in both data sets, but it could be within a minute or two.
* More than one item could have been sold on a given day by a salesperson, but I do NOT have an Order ID
* I also do not have a trader ID, so I am relying on the combination of date/time, SKU, Department, Price, and Quantity to find the records that most likely belong to Sam
* SKU ID needs to be exact, and Item Category needs to be exact. The fuzzy logic comes into play for the Date/Time, price, and probably quantity. Price could vary by a few cents, quantity could vary by a unit or two, and Date/Time could vary by a minute or two.
Example Data Set:
Dataset 1 - Sam's sales
SKU | Sales Date/Time | Item Category | Price | Quantity |
1234 | 2023-10-15 10:52:01 | A | 20.02 | 100 |
5678 | 2023-10-15 13:45:05 | A | 12.03 | 20 |
1234 | 2023-10-14 16:34:03 | B | 9.04 | 3 |
Dataset 2 - Sales for all employees
SKU | Sales Date/Time | Item Category | Price | Quantity |
1234 | 2023-10-15 11:05:23 | A | 15.03 | 100 |
1234 | 2023-10-15 10:53:01 | A | 20.02 | 100 |
9231 | 2023-10-13 13:23:33 | B | 9.04 | 3 |
5678 | 2023-10-15 13:45:05 | A | 12.03 | 20 |
5678 | 2023-10-15 13:45:05 | B | 12.03 | 20 |
1234 | 2023-10-14 16:34:03 | B | 9.02 | 3 |
In the example above, Sam's three trades would most likely be the three records shown in blue in the full employee dataset. They don't match exactly to Sam's sales data, but they are close enough that I could reasonably say they belong to Sam.
What I've tried so far
I've mostly focus on trying to configure the Fuzzy Match tool as a merge with a field that shows where each record came from, and a unique RecordID field. I've set the Match Style to "Exact" for SKU ID and "Item Category."
For the DateTime, price, and quantity fields, I've set a Custom Match style that generates keys for Digits Only and a Match Function that is Character (No Spaces): Best of Jaro and Levenshtein Distance. I've tried different match thresholds ranging from 20% - 80%.
What I'm getting
No matter what threshold I set, I keep getting an output that is only able to match ~100 of Sam's sales out of ~16,000. The Match Score for all fields always outputs "100," even though I've changed the threshold to different values.
Any ideas on what I'm doing wrong with the fuzzy match tool, or if I might be able to do this in a different way without Fuzzy Matching?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@NaiLo Since you set the Match Style for SKU and Item Category, you can start by joining on those fields to remove anything that has a 100% match to those fields from the dataset. You can also theoretically solve this issue by creating parameters by which you can filter. In this solution, I created a price parameter of 99 cents and a test to see if the difference between Sam's price and the dataset price is <= the absolute value of 0.99. I also created a quantity parameter of 1 and a DateTime parameter of 1 and tests for each. You can change the parameters to what you feel is best for your dataset. This is by no means the only solution, but it'll get you started in that direction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@Prometheus This worked well for my use case, thank you so much! The only thing I changed was where you wrapped the Absolute Value function. In your workbook it's wrapped around the parameter rather than the difference calculation.
I'm a bit bummed I still haven't found a legitimate use case for the Fuzzy Match tool. One day!
