Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

looking for duplicates across multiple rows and values in multiple columns

stock
5 - Atom

I am in need of finding total duplicates in a CSV file where there is multiple criteria for what is considered a duplicate. 

This is what I need to check against using a CSV that has millions of records. 

 

 

 

IF
(!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME] AND
    !IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
    !IsEmpty([BILLINGADDRESS1]) AND [BILLINGADDRESS1]=[BILLINGADDRESS1] AND
    !IsEmpty([BILLINGZIPCODE]) AND [BILLINGZIPCODE]=[BILLINGZIPCODE]
) 
OR
(!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME] AND
    !IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
    !IsEmpty([BILLINGCITY]) AND [BILLINGCITY]=[BILLINGCITY] AND
    !IsEmpty([BILLINGZIPCODE]) AND [BILLINGZIPCODE]=[BILLINGZIPCODE]
) 
OR
(!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME] AND
    !IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
    !IsEmpty([BILLINGZIPCODE]) AND [BILLINGZIPCODE]=[BILLINGZIPCODE]
)
OR
(!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME] AND
    !IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
    !IsEmpty([PHONE]) AND [PHONE]=[PHONE]
)
OR
(!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME] AND
    !IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
    !IsEmpty([EMAIL]) AND [EMAIL]=[EMAIL]
)
THEN 1  ELSE -1 ENDIF

 

 

 

Based on the above criteria the below would be True or False, if I did not miss anything in my checking. 

 

 

6 REPLIES 6
TheOC
15 - Aurora
15 - Aurora

hey @stock 

Sorry if im misunderstanding - but you may have some joy with the unique tool - this can be used to filter out duplicates. 
https://help.alteryx.com/designer-cloud/unique-tool

You can then join this data with your previous data, to filter out any duplicate results (or flag them in this case).

It may be easier to use than a long formula.

Hope this helps!
TheOC


Bulien
stock
5 - Atom

@theoc

 

Thanks for the comment. I have looked at and used the unique tool for one simple comparison, but in the situation I posted, there is the possibility of finding duplicates based on 5 different scenarios (data sets) where a duplicate can be identified, as I indicated with my code sample.

 

Maybe I am over-complicating it. How would you go about identifying duplicates based on the possibility of 5 different combinations? I don't see such an option to do this kind of comparison using the unique tool, and that is why I went with the formula tool. One hang up is that unique tool only allows me to select certain columns, but I don't see any way to check and ensure that the values are not blank. That is the first retirement - I do not want to compare if the value(s) are blank.

 

Being a newbie, I would love some help and explanation. I don't know the tool very well, so please don't assume I know what certain things do. I prefer that you state all that you are thinking and referring to so that I am not left questioning.

 

Thanks. 

gabrielvilella
14 - Magnetar

Hi @stock, you need one filter and one unique tool for each criteria. If you provide a sample dataset we will be able to provide a sample workflow. 

stock
5 - Atom

Here is the same sample data that I included in the screenshot. 

 

FIRSTNAMELASTNAMEBILLINGADDRESS1BILLINGZIPCODEBILLINGCITYPHONEEMAIL
JohnDean123 main Street29210anywhere5554444444john@email.com
JohnDean123 main Street29210anywhere5554444444john@email.com
JamesSmith444 west ave12345somewhere2223334444james@email.com
BobBird555 w 1000 s55548 6665552222bob@email.com
BobBird555 w 1000 s hometown6665552222bob@email.com
MHanks222 18th ave66589nowhere6665552222mrhanks@email.com
MaryHanks222 18th ave66589nowhere6665552222maryh@email.com
BaryJones510 laurel street55878outback5558887777bary@email.com
BaryJones 55878outback5558887777bary@email.com
DarrelWilliams555 another street87898lower fourty5558887777bary@email.com
DarlinePuts   3335468888darline@email.com
DarlinePuts   3335468888darline@email.com
HenryBarks245 west winding way54898none5789789988henry@email.com
RalphieFoo   6589658965 
RalphieFoo   6589658965 
Watermark
12 - Quasar
12 - Quasar

Even without a data set, looking at what you have above, I'd be more inclined to leverage at least a filter to start off with.  Everyone of the above have the commonality of: 

 

!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME]

AND

!IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND

 

Filter that T/F and if it doesn't meet the criteria then apply -1. 

 

From there your criteria simplifies greatly for your 4 others. Just drop the first part out if it doesn't meet the name conditions. 

 

Watermark
12 - Quasar
12 - Quasar

From there you can screen a piece at a time. Sometimes it's better to have a few tools and be clear at each step, instead of trying to impose a complex set of rules all wrapped together into 1. 

Labels