Good Day Everyone !
I have a dataset where I need to identify duplicates based on some conditions and use the oldest date for Tie breakers. Here is the Data
From a set of invoice numbers, I need to identify only AB/*. For all records with invoice numbers starting from AB we need to eliminate the duplicates while not touching all other formats of invoice numbers.
If there are multiple duplicates identified then i need to select the record that has the last modified date.
Here is the sample data
| Invoice number | Modified Date |
| CA/12001234 | 23.03.2023 |
| CA/12001234 | 23.03.2023 |
| BD/120001234 | 23.03.2023 |
| AP/120001235 | 23.02.2023 |
| AB/120001235 | 23.03.2023 |
| AB/12345678 | 23.03.2023 |
| AB/12345678 | 23.02.2023 |
| AB/12345678 | 23.03.2023 |
| AB/12345678 | 23.03.2023 |
| AB/7080000756 | 23.02.2023 |
| AB/7080000756 | 23.03.2023 |
In the above dataset, invoices starting with CA,BD,AP do not need to be touched. For the AP ones, we need to identify the duplicates and for multiple duplicates the record with the oldest date needs to come through. In the above example, the below records need to come through.
| AB/12345678 | 23.02.2023 |
| AB/7080000756 | 23.02.2023 |
Thanks for your suggestions.