Joining large data sets - efficiency tips?

Question

Hi, I have two very large data sets (~1.5M rows, 70 columns) of YTD data that I need to join to be able to derive each subsequent month's results. I created a simplified dummy table of the data structure below to try to illustrate without getting into the actual context.

Table 1

CountryCityFruitJan YTD SalesUnited StatesHoustonApple1United StatesLos AngelesApple2CanadaCalgaryApple3MexicoMexico CityOrange4

Table 2

CountryCityFruitFeb YTD SalesUnited StatesHoustonApple1United StatesLos AngelesApple3CanadaCalgaryApple4MexicoMexico CityOrange5Costa RicaSan JoseMango2

What I want to do is join the two data sets so I can take the differences of Feb YTD and Jan YTD sales for records that existed in January (i.e. United States Los Angeles Apple) but also add new records that first appeared in February (i.e. Costa Rica San Jose Mango) to create a data set with only February sales. My idea is to create a key like "CountryCityFruit" as the unique identifier between the two and take the difference between Feb YTD and Jan YTD sales for the Join records + Left records as Feb only.

Output

CountryCityFruitSalesMonthUnited StatesHoustonApple1JanuaryUnited StatesLos AngelesApple2JanuaryCanadaCalgaryApple3JanuaryMexicoMexico CityOrange4JanuaryUnited StatesLos AngelesApple1FebruaryCanadaCalgaryApple1FebruaryMexicoMexico CityOrange1FebruaryCosta RicaSan JoseMango2February

One question I have: is it more efficient to concatenate as many fields as possible as the key, or is it better to use fewer fields? The current join takes hours to process, and I wonder if there's a more efficient/less-process-intensive way to do this given I'll need to replicate this for every month this year.

Thanks in advance!

Bren_Spill2 · Answer

Hi @jrhess003 - please see attached a workflow that should help. You don't need to create a unique identifier, instead just put the three fields you want to join on into the Join tool configuration under Join by Specific Fields. And then union in the data that dropped out for Feb.

As for creating efficiencies, before doing the join, you can summarize based on the same fields you are planning to join on which should decrease the number of records running through the join tool.

Any questions please let me know!

(Note: Updated attachment to match your output)

Join.yxmd