Alert: There is a planned Community maintenance outage October 16th from approximately 10 - 11 PM PST. During this time the Alteryx Community will be inaccessible. Thank you for your understanding!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Using Advanced Join Macro for Large Data Set

GauravRawal
8 - Asteroid

Hi,

 

I am trying to join two tables say, T1 and T2. Table T1 has around 8 million records and T2 has 20k records. Table T1 has two fields enterance_date and exit_date and T2 has field movement_time. All the fields are in DateTime format.

 

I need to perform join on the two tables for the below condition,

 

[entrance_date] <= [movement time] && [exit_date] >= [movement time] 

 

I tried using the Advance Join macro from analytics gallery to join for above condition and it looks like below:

 

Advanced Join.PNG

 

 

The condition has been put as shown below:

 

Join Condition.PNG

 

When i ran the flow, it turned out that the join was behaving extremely slow(possibly because of cross join of huge no of records) and was taking hours to get completed. 

 

Is there a way to improve the performance, for the above problem?

 

Thanks

2 REPLIES 2
Inactive User
Not applicable

A few things you can try to improve performance:

 

  • Use of In DB tools. This is what I would recommend personally if doing a big join on DB tables.
  • If these are files, try writing them to YXDB first then performing the join.
    • Or you can try using Calgary format as well to help speed up the join.
  • Convert dates to keys (numerical surrogate keys) and then perform the join. This will speed up the join instead of joining on a date field.

You could also try simplifying the join. Join without the condition and put a filter tool after the join that specifies your condition and removes records that do not match it.

 

SeanAdams
17 - Castor
17 - Castor

Hey @GauravRawal,

As @Inactive User mentioned, doing this InDB would certainly be quicker - especially because this gives you the option to apply indices to the tables which is the single biggest speed uplift on any join.

 

If these tables are not both on the same server, you can stream the 20k one into the other server using the InDB tools.

 

If you need to keep this on the desktop for some reason e.g. no access to write to the database) you can also look at a few other things:

- With a cross join like this you're creating 8M x 20k combinations - 160Bn rows combinations, which is a lot of work for any machine to do via brute-force, so you need to find ways to reduce this.   How big is your expected result-set - if it's 160Bn rows, then you may need to think through ways of clustering this query or forcing into a purpose-built tech - but if out of these 160Bn rows you're only expecting to return 1000 or 2000 rows, then your best solution is filtering the data before join.

- are you able to reduce the size of these two row sets by filtering out un-needed rows?  for example, if these are events, and you're only looking for events in May 2017, that should reduce the row-set quite a lot

- is there any other key that you can use that will not explode into a cross-join?  Again, for example, can you enrich the T2 table with an event ID of some kind to make the join tighter?

 

Additionally - instead of bringing back the full 8M and 20k rows from each set - it's always good to only bring back the primary key and the dates if these are the fields needed initially, and then you can always enrich with other fields further down the workflow.   This stops your 160Bn row cross join from having to worry about dozens of other fields.

 

If either this or Ryan's ideas get you to a solution, would you mind marking this as solved (with the green "Solution" button), or if you are still having challenges, then feel free to reply and we'll iterate.

 

Cheers

Sean

Labels