Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Joining large data sets - efficiency tips?

jrhess003
5 - Atom

Hi, I have two very large data sets (~1.5M rows, 70 columns) of YTD data that I need to join to be able to derive each subsequent month's results. I created a simplified dummy table of the data structure below to try to illustrate without getting into the actual context.

 

Table 1

CountryCityFruitJan YTD Sales
United StatesHoustonApple1
United StatesLos AngelesApple2
CanadaCalgaryApple3
MexicoMexico CityOrange4
    

 

Table 2

CountryCityFruitFeb YTD Sales
United StatesHoustonApple1
United StatesLos AngelesApple3
CanadaCalgaryApple4
MexicoMexico CityOrange5
Costa RicaSan JoseMango2

 

What I want to do is join the two data sets so I can take the differences of Feb YTD and Jan YTD sales for records that existed in January (i.e. United States Los Angeles Apple) but also add new records that first appeared in February (i.e. Costa Rica San Jose Mango) to create a data set with only February sales. My idea is to create a key like "CountryCityFruit" as the unique identifier between the two and take the difference between Feb YTD and Jan YTD sales for the Join records + Left records as Feb only.

 

Output

CountryCityFruitSalesMonth
United StatesHoustonApple1January
United StatesLos AngelesApple2January
CanadaCalgaryApple3January
MexicoMexico CityOrange4January
United StatesLos AngelesApple1February
CanadaCalgaryApple1February
MexicoMexico CityOrange1February
Costa RicaSan JoseMango2February

 

One question I have: is it more efficient to concatenate as many fields as possible as the key, or is it better to use fewer fields? The current join takes hours to process, and I wonder if there's a more efficient/less-process-intensive way to do this given I'll need to replicate this for every month this year.

 

Thanks in advance!

1 REPLY 1
Bren_Spill
12 - Quasar

Hi @jrhess003 - please see attached a workflow that should help. You don't need to create a unique identifier, instead just put the three fields you want to join on into the Join tool configuration under Join by Specific Fields. And then union in the data that dropped out for Feb.

 

As for creating efficiencies, before doing the join, you can summarize based on the same fields you are planning to join on which should decrease the number of records running through the join tool.

 

Any questions please let me know!

 

(Note: Updated attachment to match your output)

Labels