Hi all,
Many of our data sets are stamped with date-time stamp pairs denoting validity date - so for each key (e.g. person ID) you have various updates denoted by the dates (like a slowly changing dimension table)
So to find the latest version of a record you could:
- Summarise to find the max-date for each personID; and then join this back to the main data set or....
- Sort by the key ascending; and by the validity date from field descending - and then use a unique tool to filter out duplicates (thinking here is that if you sort this way - then the unique tool will take the first instance of person ID; and then throw away the others)
However - it seems that the second one doesn't guarantee the maximum because the records start processing through the unique before the sort is finished. Not sure if it would be guaranteed to work if we added a block component between the sort and the distinct so that the distinct is guaranteed to get the full sorted set.
Have any of you experienced this and done any testing to confirm if the sort/unique method does guarantee the same outcome?
It feels at a gut-level to be more computationally efficient (in a big-O sense but that does depend on how this is handled under the hood) and it looks neater too - but if it doesn't work then it's back to the old summarize and join .... :-)
Solved! Go to Solution.
Thank you for the reply - just done some tests on a 180M row set and confirmed what you said (same result with and without a block until done; and exactly the same result as using a summarize with join back to the set, but about 4x faster using sort and distinct).
The way that you generated dates in a loop was also helpful - I've not seen this done before, but it makes sense (and has opened up my thinking about the capability of this row generator tool).
Appreciate it, thank you
Sean