Bring your best ideas to the AI Use Case Contest! Enter to win 40 hours of expert engineering support and bring your vision to life using the powerful combination of Alteryx + AI. Learn more now, or go straight to the submission form.
Start Free Trial

Alteryx Designer Cloud Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Cloud.
SOLVED

Recipe for deleting duplicate rows based on 1 column (PrimaryKey)

TrifactaUsers
10 - Fireball

We have a dataset with a PrimaryKey, called AddressUuid, see http://prntscr.com/kdcxzo. A job runs every night with new statusses of the same AddressUuid. I want to dedouplicate based on AddressUuid and a date field. I found some documentation but was not able to https://cloud.google.com/dataprep/docs/html/Deduplicate-Data_57344590. Can you give me the correct recipe? 

 

Row 1

AddressUuid xxx

Date 8-1-2018

 

Row 2

AddressUuid xxx

Date 7-31-2018

 

I only want to keep row 1, row 2 should be deleted..

2 REPLIES 2
Trifacta_Alumni
Alteryx Alumni (Retired)

Hi John,

 

it would take two steps to do this. The first is to create a ranking column, the second is to delete values where ranking column not equal to 1. This basically indicates your primary addressUuid (based on most recent date).

 

1) create a new column indicating the rank of each row for each AddressUuid. We will use the rownumber() function for this.

derive value: ROWNUMBER() group: AddressUuid order: -Date

the -Date ensures the column is descending, so more recent date first.

 

2) filter out the duplicates (values greater than 1)

filter type: greaterThan col: Accepts_CreditCards greaterThan: 1 action: Delete

 

Hope this helps!

Trifacta_Alumni
Alteryx Alumni (Retired)

Updated with the following how-to article based on this question: https://community.trifacta.com/s/article/Remove-duplicate-rows-based-on-a-primary-key