Hi,
I am trying to join two tables say, T1 and T2. Table T1 has around 8 million records and T2 has 20k records. Table T1 has two fields enterance_date and exit_date and T2 has field movement_time. All the fields are in DateTime format.
I need to perform join on the two tables for the below condition,
[entrance_date] <= [movement time] && [exit_date] >= [movement time]
I tried using the Advance Join macro from analytics gallery to join for above condition and it looks like below:
The condition has been put as shown below:
When i ran the flow, it turned out that the join was behaving extremely slow(possibly because of cross join of huge no of records) and was taking hours to get completed.
Is there a way to improve the performance, for the above problem?
Thanks
Solved! Go to Solution.
A few things you can try to improve performance:
You could also try simplifying the join. Join without the condition and put a filter tool after the join that specifies your condition and removes records that do not match it.
Hey @GauravRawal,
As @Inactive User mentioned, doing this InDB would certainly be quicker - especially because this gives you the option to apply indices to the tables which is the single biggest speed uplift on any join.
If these tables are not both on the same server, you can stream the 20k one into the other server using the InDB tools.
If you need to keep this on the desktop for some reason e.g. no access to write to the database) you can also look at a few other things:
- With a cross join like this you're creating 8M x 20k combinations - 160Bn rows combinations, which is a lot of work for any machine to do via brute-force, so you need to find ways to reduce this. How big is your expected result-set - if it's 160Bn rows, then you may need to think through ways of clustering this query or forcing into a purpose-built tech - but if out of these 160Bn rows you're only expecting to return 1000 or 2000 rows, then your best solution is filtering the data before join.
- are you able to reduce the size of these two row sets by filtering out un-needed rows? for example, if these are events, and you're only looking for events in May 2017, that should reduce the row-set quite a lot
- is there any other key that you can use that will not explode into a cross-join? Again, for example, can you enrich the T2 table with an event ID of some kind to make the join tighter?
Additionally - instead of bringing back the full 8M and 20k rows from each set - it's always good to only bring back the primary key and the dates if these are the fields needed initially, and then you can always enrich with other fields further down the workflow. This stops your 160Bn row cross join from having to worry about dozens of other fields.
If either this or Ryan's ideas get you to a solution, would you mind marking this as solved (with the green "Solution" button), or if you are still having challenges, then feel free to reply and we'll iterate.
Cheers
Sean