Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Fuzzy Match with a large dataset

TH
8 - Asteroid

I'm trying to do a fuzzy match on a rather large dataset of more that two million rows.

 

Here's a small example of the kind of data that I'm using. My data has a similar structure to what's below.

 

BRAND              MODEL      DESC

SONY                PS              Play Station Game Console

NINTENDO        Wii              Green Wii Sport Edition

MICROSOFT     XB              XBOX 250G hard drive

NINTENDO       Wii12345     Wii model onetwothreefourfive

MICROSOFT    XBONE       XBOX One

MICROSOFT    XBOXONE  XBOX one Dark Souls Edition

SONY               PS5            PlayStation 5 Special

...(And two million more rows)

 

As I noted, my true data has over two million rows.

What I want to do is to be able to group the rows by their BRAND value (in the example there are 3 distinct values. In my data there are over 400) and then run a fuzzy match inside that smaller set.

In the example above there exist 2 SONY rows, 2 NINTENDO rows, and 3 MICROSOFT rows.

When I run a fuzzy match on MODEL I want to be able to compare any particular SONY row only to other SONY rows,a NINTENDO row only to other NINTENDO rows, und so weiter.

This division of the search is important because fuzzy matching two million rows to each other would take a lot of time.

 

Note that using a bunch of filter tools isn't feasible because of the number of categories. Also, I considered using a Python tool and writing my own code to do this, but the python module will only output at most five datasets. That's a far cry from the number I need.

 

Is there any reasonable and/or simple way to do this?

3 REPLIES 3
gawa
15 - Aurora
15 - Aurora

hi @TH 

 

You can go with Batch Macro and control parameter would be "BRAND".  If you are not familiar with Batch macro, kindly provide some dataset in excel or csv so I'll prepare sample WF.

TH
8 - Asteroid

I thought I knew how batch macros worked. Looks like I may not have been as knowledgeable as I thought I was.

As you might expect my real data is sensitive.

I mocked up some data in Excel that should serve to illustrate the kind of thing I'm looking at.

I gave it five different BRAND values and a bunch of plausible MODEL values.

 

I look forward to your idea.

 

gawa
15 - Aurora
15 - Aurora

@TH 

 

Please find the attached WF for sample of Batch macro. As the given dataset contains totally distinct and unsimilar values in field "MODEL", no records are match to each other, so please configure it by yourself to fit to your case. (For this reason, I don't output Fuzzy Match result but instead output Count Records as just an example of Batch macro)

 

Good luck!

image.png

Labels