Bring your best ideas to the AI Use Case Contest! Enter to win 40 hours of expert engineering support and bring your vision to life using the powerful combination of Alteryx + AI. Learn more now, or go straight to the submission form.
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

looking for duplicates across multiple rows and values in multiple columns

stock
5 - Atom

I am in need of finding total duplicates in a CSV file where there is multiple criteria for what is considered a duplicate. 

This is what I need to check against using a CSV that has millions of records. 

 

 

 

IF
(!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME] AND
    !IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
    !IsEmpty([BILLINGADDRESS1]) AND [BILLINGADDRESS1]=[BILLINGADDRESS1] AND
    !IsEmpty([BILLINGZIPCODE]) AND [BILLINGZIPCODE]=[BILLINGZIPCODE]
) 
OR
(!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME] AND
    !IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
    !IsEmpty([BILLINGCITY]) AND [BILLINGCITY]=[BILLINGCITY] AND
    !IsEmpty([BILLINGZIPCODE]) AND [BILLINGZIPCODE]=[BILLINGZIPCODE]
) 
OR
(!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME] AND
    !IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
    !IsEmpty([BILLINGZIPCODE]) AND [BILLINGZIPCODE]=[BILLINGZIPCODE]
)
OR
(!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME] AND
    !IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
    !IsEmpty([PHONE]) AND [PHONE]=[PHONE]
)
OR
(!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME] AND
    !IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
    !IsEmpty([EMAIL]) AND [EMAIL]=[EMAIL]
)
THEN 1  ELSE -1 ENDIF

 

 

 

Based on the above criteria the below would be True or False, if I did not miss anything in my checking. 

 

 

6 REPLIES 6
TheOC
16 - Nebula
16 - Nebula

hey @stock 

Sorry if im misunderstanding - but you may have some joy with the unique tool - this can be used to filter out duplicates. 
https://help.alteryx.com/designer-cloud/unique-tool

You can then join this data with your previous data, to filter out any duplicate results (or flag them in this case).

It may be easier to use than a long formula.

Hope this helps!
TheOC

Cheers,
TheOC
Connect with me:
LinkedIn Bulien
stock
5 - Atom

@theoc

 

Thanks for the comment. I have looked at and used the unique tool for one simple comparison, but in the situation I posted, there is the possibility of finding duplicates based on 5 different scenarios (data sets) where a duplicate can be identified, as I indicated with my code sample.

 

Maybe I am over-complicating it. How would you go about identifying duplicates based on the possibility of 5 different combinations? I don't see such an option to do this kind of comparison using the unique tool, and that is why I went with the formula tool. One hang up is that unique tool only allows me to select certain columns, but I don't see any way to check and ensure that the values are not blank. That is the first retirement - I do not want to compare if the value(s) are blank.

 

Being a newbie, I would love some help and explanation. I don't know the tool very well, so please don't assume I know what certain things do. I prefer that you state all that you are thinking and referring to so that I am not left questioning.

 

Thanks. 

gabrielvilella
14 - Magnetar

Hi @stock, you need one filter and one unique tool for each criteria. If you provide a sample dataset we will be able to provide a sample workflow. 

stock
5 - Atom

Here is the same sample data that I included in the screenshot. 

 

FIRSTNAMELASTNAMEBILLINGADDRESS1BILLINGZIPCODEBILLINGCITYPHONEEMAIL
JohnDean123 main Street29210anywhere5554444444john@email.com
JohnDean123 main Street29210anywhere5554444444john@email.com
JamesSmith444 west ave12345somewhere2223334444james@email.com
BobBird555 w 1000 s55548 6665552222bob@email.com
BobBird555 w 1000 s hometown6665552222bob@email.com
MHanks222 18th ave66589nowhere6665552222mrhanks@email.com
MaryHanks222 18th ave66589nowhere6665552222maryh@email.com
BaryJones510 laurel street55878outback5558887777bary@email.com
BaryJones 55878outback5558887777bary@email.com
DarrelWilliams555 another street87898lower fourty5558887777bary@email.com
DarlinePuts   3335468888darline@email.com
DarlinePuts   3335468888darline@email.com
HenryBarks245 west winding way54898none5789789988henry@email.com
RalphieFoo   6589658965 
RalphieFoo   6589658965 
Watermark
12 - Quasar
12 - Quasar

Even without a data set, looking at what you have above, I'd be more inclined to leverage at least a filter to start off with.  Everyone of the above have the commonality of: 

 

!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME]

AND

!IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND

 

Filter that T/F and if it doesn't meet the criteria then apply -1. 

 

From there your criteria simplifies greatly for your 4 others. Just drop the first part out if it doesn't meet the name conditions. 

 

Watermark
12 - Quasar
12 - Quasar

From there you can screen a piece at a time. Sometimes it's better to have a few tools and be clear at each step, instead of trying to impose a complex set of rules all wrapped together into 1. 

Labels
Top Solution Authors