This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I have a large dataset of over 10m rows. Within my source data I have Product ID, Product Description and Brand. I also have a 'clean' Product ID look up table to map to.
The issue I have is that often my source Product ID is missing (could be part of the Product Description), there is other text in the Product ID, the Product ID could be part of the description, etc. Also, I am only looking to map a subset of the Product Data (eg particular item(s) which may only have 200 unique lookup values) BUT these may be similar to other product types in which case fuzzy matching may throw up some unwanted matches.
So..... can anyone suggest a suitable hierarchy/workflow for this ?/
I've thought of a possible workflow :
Match directly on Product ID in the first instance.
For these matches - get the unique Brand and use this to create a subset of the records that didn't match in Step 1 (this reduces the likelihood of trying to match on Product ID's that are similar to the Product ID Lookup)
Take this subset and carry out a Find & Replace on the Product ID.
Would a Fuzzy Match on the Product Code be the next step? As I could have 100s of 1000s of records could this be time consuming and crash the PC/Server??