Hello community,
Maybe this is very simple but I've looked at it for hours and losing my mind slowly so need to ask support.
What I need is to remove duplicates based on the reference to only have 1 record per reference but I need to keep the line that has comments and not just plain remove and keep the first one. If for the same reference, there are multiple lines with comments, I'd keep the one that has comments and highest age on the other column.
If multiple lines exist and no comments, I'd keep the highest age - but again, there could be duplicates there also so I need to keep only 1.
The comments can all be differently generated which also creates problems there.
Anyone fancy having a crack at it? :)
References | Counterparty Name | Latest Comment Text | Age |
55207225 | Jack Sparrow | 0 | |
55207225 | Jack Sparrow | 4 | |
55207225 | Jack Sparrow | 5 | |
55207324 | Jack Sparrow | 5 | |
55207324 | Jack Sparrow | 5 | |
55207324 | Jack Sparrow | 0 | |
89819654 | Jack Sparrow | 192 | |
90024782 | Persona J | 7 | |
90024782 | Persona J | 12 | |
91453297 | John Smith LTD | 6 | |
91453297 | John Smith LTD | 14 | |
91453297 | John Smith LTD | Client signed Please verify. | 39 |
91472861 | Music House LTD | Client signed Please verify….. | 6 |
91474819 | John Smith LTD | 6 | |
91474819 | John Smith LTD | Client signed Please verify. | 34 |
91485774 | John Smith LTD | 6 | |
91485774 | John Smith LTD | 11 | |
91485774 | John Smith LTD | 14 | |
91485774 | John Smith LTD | Client signed Please verify. | 45 |
91485774 | John Smith LTD | Client signed Please verify…... | 45 |
91485774 | John Smith LTD | Client signed Please verif---- | 34 |
91521718 | Jane Doe Inc. | No comment | 101 |
91539944 | ABC Corp | To be seen | 684 |
91544027 | XYZ, SA | 0 | |
91544027 | ABC Corp | Client signed Please verify. | 45 |
91544027 | ABC Corp | Client signed Please verify. | 13 |
91544027 | ABC Corp | NCR | 181 |
91544027 | ABC Corp | 242 | |
91544027 | ABC Corp | 669 |
Thanks,
Marcos
Solved! Go to Solution.
Hi @abacon ,
Not sure this works. Looking at your solution on the outcome, ABC Corp for example was taken with age group 669 when the right one to be taken would've been age 181 - "NCR" comment.
Not really see the logic on the filter Latest comment is not null because the False is empty for some reason and is being connected to the join tool.
Can you check?
@MarcosC_ Good call, turn off the replace nulls options in the data cleansing tool. I added it to remove punctuation, some of the lines were duplicates aside from multiple periods in them.
Hi @abacon ,
Removing that part on the data cleansing worked however, I'm running into another issue with my actual dataset.
I have in some cases 4 identical references, same client, all lines commented and the age is the exact same. So all 4 lines still appear.
Also the actual file has a lot more columns. I was trying to Join those after your workflow example but obviouly that gets messed up.
How can I fix this scenario?
@MarcosC_ There may be a better way but does a unique tool to with the fields that are all identical selected work? That will remove duplicates where the selected columns are equal.
Or you could try a summarize tool, grouping on the fields you need, that will summarize all the duplicates together.
Bacon
Yeah sorting through the Unique worked. After joining both datasets doing Unique ref + latest comments worked and cleared most duplicates.
Thanks
Hi @abacon ,
Continuing to run into troubles. Tried multiple unions and uniques trying to remove it but then data gets joined, duplicates remain.
Attached the file files with masked data. I need to remove the duplicates while leaving the latest one that has a comment. If the age is the same and both reference and comment is identical or similar, I should just keep one. Perhaps based on the system entry date - get latest. that will be a unique date and hour so probably can't be mirrored like these keep being.
Attached I send the data after some clean data cleaning just before the attempt of duplicate removal, and afterwards running it with your flow.
I need to keep all these collumns.
Any ideas?
I would look to adding a summarize tool, grouping on your relevant fields - the ones that you want to keep. Then do a max on the date field you want to take from. This will remove duplicates and only keep one record, if multiple spit out, they aren't duplicates.
Bacon