A database containing sales info needs to be cleaned where:
- exact duplicates need to be removed;
- duplicates where a sales type is different but sale price, year, and quarter are the same;
- duplicates where a sales type is different and quarter is different, but price is the same;
- duplicates where year, quarter, sale type are the same, but sale price increases. The entry with the higher value should remain.
Properties that are sold in a different year at a different price should remain. Properties that are sold in the same year at different quarters and are both S11 should remain. S11 sale type will be kept over other sale types where a duplicate exists. There are many adjoining columns containing additional info that need to kept intact. It would also be handy to view which entries have been removed.
Current Data
Sale Date | Sale Year | Sale Quarter | Sale Type | Sale Price | Address | Additional Info |
01/12/2021 | 2021 | Q4 | S11 | $30 | 30 Andrew Avenue | x |
01/12/2021 | 2021 | Q4 | S11 | $30 | 30 Andrew Avenue | x |
01/03/2014 | 2014 | Q1 | S13 | $70 | 123 Stella Drive | x |
01/03/2014 | 2014 | Q1 | S11 | $70 | 123 Stella Drive | x |
01/05/2010 | 2010 | Q2 | S13 | $55 | 99 Problems Road | x |
01/08/2010 | 2010 | Q3 | S11 | $55 | 99 Problems Road | x |
01/02/2005 | 2005 | Q1 | S12 | $32 | 99 Problems Road | x |
01/11/2019 | 2019 | Q4 | S11 | $64 | 1 Big Crescent | x |
01/11/2019 | 2019 | Q4 | S11 | $68 | 1 Big Crescent | x |
Desired Output
Sale Date | Sale Year | Sale Quarter | Sale Type | Sale Price Final | Address | Additional Info |
01/12/2021 | 2021 | Q4 | S11 | $30 | 30 Andrew Avenue | x |
01/03/2014 | 2014 | Q1 | S11 | $70 | 123 Stella Drive | x |
01/05/2010 | 2010 | Q3 | S11 | $55 | 99 Problems Road | x |
01/02/2005 | 2005 | Q1 | S12 | $32 | 99 Problems Road | x |
01/11/2019 | 2019 | Q4 | S11 | $68 | 1 Big Crescent | x |
Would really appreciate some help with this - even a partial solution / method of action for some of the cleaning would be great! Many thanks.
Solved! Go to Solution.
Hi @FilipO
Here's one option but I'm curious if anyone can come up with a more clever solution. Since the unique tool will keep the first unique record it finds, you can play around with sorting to apply your logic (i.e. sort the numbers high to low so the highest price is kept).
@FilipO I made an additional feature of bucketing the output and I put only unique tool so you could see records that are being compared before deduping. There seem to be more conditional cases that should be considered. Please provide the whole data set to anticipate the conditions. For now, feel pre to alter the workflow.
Hi @FilipO, here is my solution to your problem not sure how this will act on a bigger dataset but able to resolve the problem for the provided sample data and there are many conditions applicable on the dataset.
I hope this helps!
Thanks!