We have a dataset with a PrimaryKey, called AddressUuid, see http://prntscr.com/kdcxzo. A job runs every night with new statusses of the same AddressUuid. I want to dedouplicate based on AddressUuid and a date field. I found some documentation but was not able to https://cloud.google.com/dataprep/docs/html/Deduplicate-Data_57344590. Can you give me the correct recipe?
Row 1
AddressUuid xxx
Date 8-1-2018
Row 2
AddressUuid xxx
Date 7-31-2018
I only want to keep row 1, row 2 should be deleted..
Solved! Go to Solution.
Hi John,
it would take two steps to do this. The first is to create a ranking column, the second is to delete values where ranking column not equal to 1. This basically indicates your primary addressUuid (based on most recent date).
1) create a new column indicating the rank of each row for each AddressUuid. We will use the rownumber() function for this.
derive value: ROWNUMBER() group: AddressUuid order: -Date
the -Date ensures the column is descending, so more recent date first.
2) filter out the duplicates (values greater than 1)
filter type: greaterThan col: Accepts_CreditCards greaterThan: 1 action: Delete
Hope this helps!
Updated with the following how-to article based on this question: https://community.trifacta.com/s/article/Remove-duplicate-rows-based-on-a-primary-key