Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Joining name fields - Do I need a macro? or is there another way

EnglishmaninNY
8 - Asteroid

I'm trying to create macro which will perform a join on person name fields. I'm struggling with a particular part which is causing the workflow to be too slow.

 

To narrow in on the specific issue I've prepared a workflow focuses on the relevant section. The right input has been limited to certain IDs and the workflow will process in 10 seconds on my machine. If you remove the filter you'll see that its simply too slow.

 

I decided that I needed to make different name fields separate rows and so the input datasets already reflects this. The datasets include Last Name, First Name, Middle Name, Prefix and Suffix. Some are missing, blank (which I'm assuming means ambiguous but hope to provide the option to included again later), some are initials and some provide entire name.

 

This is not a fuzzy matching problem. I am assuming strings are correct they might just be missing, ambiguous (just initials).

 

I know that matching only available fields causes some matches which are unlikely and I do have a plan to filter those out later. For the this part of the workflow I just need to discard what is definitely not a match.

 

This is probably the first of a series of related questions. Even if you don't have advice on this particular problem feel free to reply, I'd be happy to mention you on future posts so you can follow the development.

 

UPDATE - Reposted the package without the rogue tools

4 REPLIES 4
Prometheus
12 - Quasar

@EnglishmaninNY I didn't see in your post that you were asking a question, but I took a quick look at your workflow to see what you were talking about. The reason why it's slow is because your join is resulting in a one to many relationship, called a cartesian join. If you want it to work more quickly, you need to find unique values that each data stream shares so you can join on that. I filtered the top data stream so only RecordID 1499 passes. The output of the Join tool is 9,773 records. Unless you're going to then group on a single field or pass it through the Unique tool to get a unique record, you're going to be creating thousands of near duplicate records.What's the intended outcome?

EnglishmaninNY
8 - Asteroid

Yes the question is how to make this faster.

 

The reason I need the first join is so that I know between  any two ID what name fields are available to match on. Admitted I only need to know where there is at least one that matches (by initial) to the other ID but in the workflow I only know that once I’ve done the next step. 


I have tried doing the other way around doing name or initial matching first then figuring out the combination which didn’t match for the IDs but that has its own complexities.

patrick_digan
17 - Castor
17 - Castor

@EnglishmaninNY See if the attached helps! I changed your join to include the initials. This vastly reduced the number of records. I had to rework the Total Count then due to this change. It seemed to produce the same results as what you had, and I could allow all of the data to flow through and it processed in a minute. It still produced 60M records at one point which was a bit steep, but it ran in 1 minute.

EnglishmaninNY
8 - Asteroid

@patrick_digan Thank you. This certainly works. Actually its genius. I was unfamiliar with the replacechar() function thank you.

 

My next step is to add a score which indicates how unique the match is. The idea is to filter out matches where the data to match is simply too ambiguous (this would be a parameter in the final macro). For example matches where we just have Smith and A are probably just too ambiguous to say this is a good match. Need to do a little head scratching to see if that easy to do with your solution but I'll do another post when I get stuck if that happens.

 

Thanks again.

Labels