I am in need of finding total duplicates in a CSV file where there is multiple criteria for what is considered a duplicate.
This is what I need to check against using a CSV that has millions of records.
IF
(!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME] AND
!IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
!IsEmpty([BILLINGADDRESS1]) AND [BILLINGADDRESS1]=[BILLINGADDRESS1] AND
!IsEmpty([BILLINGZIPCODE]) AND [BILLINGZIPCODE]=[BILLINGZIPCODE]
)
OR
(!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME] AND
!IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
!IsEmpty([BILLINGCITY]) AND [BILLINGCITY]=[BILLINGCITY] AND
!IsEmpty([BILLINGZIPCODE]) AND [BILLINGZIPCODE]=[BILLINGZIPCODE]
)
OR
(!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME] AND
!IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
!IsEmpty([BILLINGZIPCODE]) AND [BILLINGZIPCODE]=[BILLINGZIPCODE]
)
OR
(!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME] AND
!IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
!IsEmpty([PHONE]) AND [PHONE]=[PHONE]
)
OR
(!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME] AND
!IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
!IsEmpty([EMAIL]) AND [EMAIL]=[EMAIL]
)
THEN 1 ELSE -1 ENDIF
Based on the above criteria the below would be True or False, if I did not miss anything in my checking.
hey @stock
Sorry if im misunderstanding - but you may have some joy with the unique tool - this can be used to filter out duplicates.
https://help.alteryx.com/designer-cloud/unique-tool
You can then join this data with your previous data, to filter out any duplicate results (or flag them in this case).
It may be easier to use than a long formula.
Hope this helps!
TheOC
@theoc
Thanks for the comment. I have looked at and used the unique tool for one simple comparison, but in the situation I posted, there is the possibility of finding duplicates based on 5 different scenarios (data sets) where a duplicate can be identified, as I indicated with my code sample.
Maybe I am over-complicating it. How would you go about identifying duplicates based on the possibility of 5 different combinations? I don't see such an option to do this kind of comparison using the unique tool, and that is why I went with the formula tool. One hang up is that unique tool only allows me to select certain columns, but I don't see any way to check and ensure that the values are not blank. That is the first retirement - I do not want to compare if the value(s) are blank.
Being a newbie, I would love some help and explanation. I don't know the tool very well, so please don't assume I know what certain things do. I prefer that you state all that you are thinking and referring to so that I am not left questioning.
Thanks.
Hi @stock, you need one filter and one unique tool for each criteria. If you provide a sample dataset we will be able to provide a sample workflow.
Here is the same sample data that I included in the screenshot.
FIRSTNAME | LASTNAME | BILLINGADDRESS1 | BILLINGZIPCODE | BILLINGCITY | PHONE | |
John | Dean | 123 main Street | 29210 | anywhere | 5554444444 | john@email.com |
John | Dean | 123 main Street | 29210 | anywhere | 5554444444 | john@email.com |
James | Smith | 444 west ave | 12345 | somewhere | 2223334444 | james@email.com |
Bob | Bird | 555 w 1000 s | 55548 | 6665552222 | bob@email.com | |
Bob | Bird | 555 w 1000 s | hometown | 6665552222 | bob@email.com | |
M | Hanks | 222 18th ave | 66589 | nowhere | 6665552222 | mrhanks@email.com |
Mary | Hanks | 222 18th ave | 66589 | nowhere | 6665552222 | maryh@email.com |
Bary | Jones | 510 laurel street | 55878 | outback | 5558887777 | bary@email.com |
Bary | Jones | 55878 | outback | 5558887777 | bary@email.com | |
Darrel | Williams | 555 another street | 87898 | lower fourty | 5558887777 | bary@email.com |
Darline | Puts | 3335468888 | darline@email.com | |||
Darline | Puts | 3335468888 | darline@email.com | |||
Henry | Barks | 245 west winding way | 54898 | none | 5789789988 | henry@email.com |
Ralphie | Foo | 6589658965 | ||||
Ralphie | Foo | 6589658965 |
Even without a data set, looking at what you have above, I'd be more inclined to leverage at least a filter to start off with. Everyone of the above have the commonality of:
!IsEmpty([FIRSTNAME]) AND [FIRSTNAME]=[FIRSTNAME]
AND
!IsEmpty([LASTNAME]) AND [LASTNAME]=[LASTNAME] AND
Filter that T/F and if it doesn't meet the criteria then apply -1.
From there your criteria simplifies greatly for your 4 others. Just drop the first part out if it doesn't meet the name conditions.
From there you can screen a piece at a time. Sometimes it's better to have a few tools and be clear at each step, instead of trying to impose a complex set of rules all wrapped together into 1.