This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
A frequent use case we get asked about is how to join two data sets on a date range where one data set has a start date field and an end date field and the other dataset has a just a date field. While at first this may seem like a tough problem since the join tool only blends the data together if the records on the fields you’re joining on match. But really, all we need to do is find a way to generate the missing dates in the range so that we can get a match from both data sets. Luckily there is a tool for that called Generate Rows.
Below is an example I created in 10.6 that illustrates two ways to use generate rows to perform a join on a date range.
In this workflow we are joining a set of customer transaction data and we want to join on a second dataset that tells us when and how long a particular sale went on.
First dataset Second dataset
The first way we can solve this problem is use generate rows directly on our second dataset with this configuration:
To make sense of this to those who are new to generate rows:
We are creating a new field in the second dataset called “Generate Date”.
Our initial value will start when the sale started so we will use “Sale start” as our Initialization Expression. We will create new rows for every record until “Generate Date” equals the end of the sale or “Sale End”, thus our condition expression is [Generated Date]
To show you what this looks like, for our first record since the sale went from 02/01/2015-02/07/2015, we now have 7 duplicate records where the only thing that changes between them is the generated date for which we now have every date in the range in its own separate record. If you’re still having trouble understanding how this was done there is an excellent example that is almost the exact same in the sample workflow for Generate Rows. To access that simple right click on Generate Rows in the tool palette and click “Open Example”.
Now that we have every date between the ranges, we can do a join with “Generated Date” and the “Transaction Date”.
And voila, we now have successfully joined the two datasets.
A potential issue of this is if you are working on a very large dataset with a lot of duplicate ranges this can put a huge damper on your performance and may cause errors due to limitations on your machine. In the example above 5 records became 37 after generate rows happened. Obviously depending on your ranges this can really cause your dataset to increase exponentially. If you’re working with an abnormally large dataset I would suggest doing the method below as it won’t increase the size of any of your data sets.
This method involves just two more steps.
After the dataset with the ranges we perform a summarize and groupby by both “Sale start” and “Sale end”
This will give you every unique combination of ranges in case you have multiple records with the same records. We then use generate rows with the same exact configuration as above and we should now have our “Generated Date field”. Now that we have this field, we can join it back to our customer dataset with the transaction dates.
From here though we have to perform one more join since we are missing the other information that was dropped out from summarizing the dates. This time we join on the Sale start and Sale End fields as we want to exact matches to those records.