Alteryx Designer Desktop Discussions

jrhess003 · ‎02-29-2024

Hi, I have two very large data sets (~1.5M rows, 70 columns) of YTD data that I need to join to be able to derive each subsequent month's results. I created a simplified dummy table of the data structure below to try to illustrate without getting into the actual context.

Table 1

Country	City	Fruit	Jan YTD Sales
United States	Houston	Apple	1
United States	Los Angeles	Apple	2
Canada	Calgary	Apple	3
Mexico	Mexico City	Orange	4

Table 2

Country	City	Fruit	Feb YTD Sales
United States	Houston	Apple	1
United States	Los Angeles	Apple	3
Canada	Calgary	Apple	4
Mexico	Mexico City	Orange	5
Costa Rica	San Jose	Mango	2

What I want to do is join the two data sets so I can take the differences of Feb YTD and Jan YTD sales for records that existed in January (i.e. United States Los Angeles Apple) but also add new records that first appeared in February (i.e. Costa Rica San Jose Mango) to create a data set with only February sales. My idea is to create a key like "CountryCityFruit" as the unique identifier between the two and take the difference between Feb YTD and Jan YTD sales for the Join records + Left records as Feb only.

Output

Country	City	Fruit	Sales	Month
United States	Houston	Apple	1	January
United States	Los Angeles	Apple	2	January
Canada	Calgary	Apple	3	January
Mexico	Mexico City	Orange	4	January
United States	Los Angeles	Apple	1	February
Canada	Calgary	Apple	1	February
Mexico	Mexico City	Orange	1	February
Costa Rica	San Jose	Mango	2	February

One question I have: is it more efficient to concatenate as many fields as possible as the key, or is it better to use fewer fields? The current join takes hours to process, and I wonder if there's a more efficient/less-process-intensive way to do this given I'll need to replicate this for every month this year.

Thanks in advance!

Bren_Spill · ‎02-29-2024

Hi @jrhess003 - please see attached a workflow that should help. You don't need to create a unique identifier, instead just put the three fields you want to join on into the Join tool configuration under Join by Specific Fields. And then union in the data that dropped out for Feb.

As for creating efficiencies, before doing the join, you can summarize based on the same fields you are planning to join on which should decrease the number of records running through the join tool.

Any questions please let me know!

(Note: Updated attachment to match your output)

Alteryx Designer Desktop Discussions

Joining large data sets - efficiency tips?