Want to get involved? We're always looking for ideas and content for Weekly Challenges.
SUBMIT YOUR IDEAHi Nikos,
in this case, the Generate Rows tool would limit the amount of data. However, I don't typically use the tool as a first pass.
Each region's range has 20 Postal Areas. Generating Rows for each region multiplies the data by 20. I multiplied by 5. While there are only 5 regions getting 20x and I'm multiplying a larger data set by 5x, my solution scales much better in real world applications.
Imagine a case in which you have a date, cost, and item. The dataset is 30MM+ records. You want to sum the costs per item over a rolling period of 12 months. If you create a date 12 months back for each row and use the Generate Rows tool to duplicate each row by 365 and self join, the join will be monstrous, use all the memory of your machine, and crash 10 hours later. You are multiplying 30MM+ records by 365 to almost 11 billion records and then self join. Or, you can create a cartesian product using the Join on item only and then filter the records out that don't fit the rolling 12 month period. Now, you can finally sum.
I experimented using both solutions and put the second one into production some time ago. While the Generate Rows would work in this artificially small and contrived dataset, it wouldn't scale well. Finally, It wouldn't be much fun if I posted the same old solution as everyone else.
Think this is a bit easier now