Hi, I have two very large data sets (~1.5M rows, 70 columns) of YTD data that I need to join to be able to derive each subsequent month's results. I created a simplified dummy table of the data structure below to try to illustrate without getting into the actual context.
Table 1
| Country | City | Fruit | Jan YTD Sales |
| United States | Houston | Apple | 1 |
| United States | Los Angeles | Apple | 2 |
| Canada | Calgary | Apple | 3 |
| Mexico | Mexico City | Orange | 4 |
| | | | |
Table 2
| Country | City | Fruit | Feb YTD Sales |
| United States | Houston | Apple | 1 |
| United States | Los Angeles | Apple | 3 |
| Canada | Calgary | Apple | 4 |
| Mexico | Mexico City | Orange | 5 |
| Costa Rica | San Jose | Mango | 2 |
What I want to do is join the two data sets so I can take the differences of Feb YTD and Jan YTD sales for records that existed in January (i.e. United States Los Angeles Apple) but also add new records that first appeared in February (i.e. Costa Rica San Jose Mango) to create a data set with only February sales. My idea is to create a key like "CountryCityFruit" as the unique identifier between the two and take the difference between Feb YTD and Jan YTD sales for the Join records + Left records as Feb only.
Output
| Country | City | Fruit | Sales | Month |
| United States | Houston | Apple | 1 | January |
| United States | Los Angeles | Apple | 2 | January |
| Canada | Calgary | Apple | 3 | January |
| Mexico | Mexico City | Orange | 4 | January |
| United States | Los Angeles | Apple | 1 | February |
| Canada | Calgary | Apple | 1 | February |
| Mexico | Mexico City | Orange | 1 | February |
| Costa Rica | San Jose | Mango | 2 | February |
One question I have: is it more efficient to concatenate as many fields as possible as the key, or is it better to use fewer fields? The current join takes hours to process, and I wonder if there's a more efficient/less-process-intensive way to do this given I'll need to replicate this for every month this year.
Thanks in advance!