Hi Folks !
I am working on a project where i have one file with a unique ID within different date ranges. The ask is to group the IDs basis the date ranges and count the unique sets. The complexity is that if start date of two sets with similar ID is within 3 days, then the sets can be counted as one, similarly if the end date of two sets with same IDs is within 2 days then they can be counted as one.
Example:
Sample Data:
Set # | ID | Start Date | End Date | Original count |
1 | 1 | 2023-11-23 | 2023-11-29 | 1 |
2 | 1 | 2023-10-13 | 2023-10-15 | 1 |
3 | 1 | 2023-11-25 | 2023-11-30 | 1 |
4 | 1 | 2023-11-26 | 2023-11-31 | 1 |
5 | 1 | 2023-11-26 | 2023-12-04 | 1 |
Expected Output:
Set# | ID | Start Date | End Date | Final Count | Inference |
1 | 1 | 2023-11-23 | 2023-11-29 | 2 | Here 2 sets of ID '1' are clubbed i.e. Set 1 and Set 3 hence count increased to 2. The reason they are clubbed is because the start date is within a 3 day range of start date of set 1 and does not exceed the end date by 2 days of set 1 |
2 | 1 | 2023-10-13 | 2023-10-15 | 1 | No date range match |
4 | 1 | 2023-11-26 | 2023-11-31 | 1 | Start date outside 3 days range |
5 | 1 | 2023-11-26 | 2023-12-04 | 1 | End date outside 2 days range |
Note: The earliest date from a set of similar ID is taken as start date and the last date of that same set is considered.
Please share your thoughts on how to accomplish this. I have huge files of such data sets
User | Count |
---|---|
52 | |
27 | |
25 | |
24 | |
21 |