Duplicate identification based on conditions

Question

Good Day Everyone !

I have a dataset where I need to identify duplicates based on some conditions and use the oldest date for Tie breakers. Here is the Data

From a set of invoice numbers, I need to identify only AB/*. For all records with invoice numbers starting from AB we need to eliminate the duplicates while not touching all other formats of invoice numbers.

If there are multiple duplicates identified then i need to select the record that has the last modified date.

Here is the sample data

Invoice number Modified Date CA/1200123423.03.2023CA/1200123423.03.2023BD/12000123423.03.2023AP/12000123523.02.2023AB/12000123523.03.2023AB/1234567823.03.2023AB/1234567823.02.2023AB/1234567823.03.2023AB/1234567823.03.2023AB/708000075623.02.2023AB/708000075623.03.2023

In the above dataset, invoices starting with CA,BD,AP do not need to be touched. For the AP ones, we need to identify the duplicates and for multiple duplicates the record with the oldest date needs to come through. In the above example, the below records need to come through.

AB/1234567823.02.2023AB/708000075623.02.2023

Thanks for your suggestions.

ShankerV · Accepted Answer

Hi @AbhijeetChib

One way of doing this.

Many thanks

Shanker V

DataNath · Accepted Answer

Hey @AbhijeetChib, here's how I'd tackle this:

- Isolate AB records

- Parse the date into ISO format for Alteryx to be able to work with it

- Summarize to Group By [Invoice Number] and pull out the minimum (oldest) date

- Reformat the date into your original display

- Union the handled AB data back to the rest of your data

Raj · Accepted Answer

sol