I'm trying to do a fuzzy match on a rather large dataset of more that two million rows.
Here's a small example of the kind of data that I'm using. My data has a similar structure to what's below.
BRAND MODEL DESC
SONY PS Play Station Game Console
NINTENDO Wii Green Wii Sport Edition
MICROSOFT XB XBOX 250G hard drive
NINTENDO Wii12345 Wii model onetwothreefourfive
MICROSOFT XBONE XBOX One
MICROSOFT XBOXONE XBOX one Dark Souls Edition
SONY PS5 PlayStation 5 Special
...(And two million more rows)
As I noted, my true data has over two million rows.
What I want to do is to be able to group the rows by their BRAND value (in the example there are 3 distinct values. In my data there are over 400) and then run a fuzzy match inside that smaller set.
In the example above there exist 2 SONY rows, 2 NINTENDO rows, and 3 MICROSOFT rows.
When I run a fuzzy match on MODEL I want to be able to compare any particular SONY row only to other SONY rows,a NINTENDO row only to other NINTENDO rows, und so weiter.
This division of the search is important because fuzzy matching two million rows to each other would take a lot of time.
Note that using a bunch of filter tools isn't feasible because of the number of categories. Also, I considered using a Python tool and writing my own code to do this, but the python module will only output at most five datasets. That's a far cry from the number I need.
Is there any reasonable and/or simple way to do this?
hi @TH
You can go with Batch Macro and control parameter would be "BRAND". If you are not familiar with Batch macro, kindly provide some dataset in excel or csv so I'll prepare sample WF.
I thought I knew how batch macros worked. Looks like I may not have been as knowledgeable as I thought I was.
As you might expect my real data is sensitive.
I mocked up some data in Excel that should serve to illustrate the kind of thing I'm looking at.
I gave it five different BRAND values and a bunch of plausible MODEL values.
I look forward to your idea.
Please find the attached WF for sample of Batch macro. As the given dataset contains totally distinct and unsimilar values in field "MODEL", no records are match to each other, so please configure it by yourself to fit to your case. (For this reason, I don't output Fuzzy Match result but instead output Count Records as just an example of Batch macro)
Good luck!