nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

In Database unions/joins - performance bottleneck

RyanNewsome

Hi, I'm using the in database tools and coming across a performance issue, wondering if anyone has experienced similar?

This isn't the exact scenario, but shows a close enough example.

There are 2 tables in an Amazon AWS Redshift server:

ProductRecords

--------------------

personid | product type | startdate | enddate

ActionRecords

--------------------

personid | action type | action date

I need to find out what product type was active at the time of action records.

So I need to join the actions to the product records, by personid and also ensure that action date falls between ProductRecords.startdate and ProductRecords.enddate.

If there is no corresponding Product record for the action, I still want to keep the action record.

Alteryx doesn't yet allow conditional joins, so I do the following:

1. Use In DB join tool (Left outer) to join the 2 tables by person id, left side being the action records so the action is still kept if there are no product record for the personid

2. Use a filter tool on the results from step 1, in order to send the records with NULL product type value to T output

3. For the records sent to F output in step 2, they are sent to another filter tool so only records where actiondate falls between startdate and end date are kept

4. Union the T output from step 2, with the results from filter in step 3 It works fast up until the union step in step

Even using a small subset of data (less than 10,000 rows), it takes atleast 5 minutes for the union tool to show as complete. It must be the way the nested sql statements are being created in the background?

I've also tried doing something similar by using multiple joins and came across same bottleneck.

I'm also getting the performance bottlenecks when using in DB unions later on in the workflow

Any ideas? It would be great if it was possible to see what SQL statements are being generated in the background which could assist in figuring out where the problem is.

in DB bottleneck on union tool.PNG

In Database

In-Database

Accepted answers

RyanNewsome

The solution ended up being getting the settings of our redshift user changed so the first schema in the priority order was a schema where we have CREATE TABLE access.

This was required as trying to push data to a temp table using the "create temp table" option in the "Write Data InDB" tool, it is actualy creating a permanant table in Redshift. This has been emailed to Alteryx client services.

Thanks to Kane Glendenning for point us in the right direction regarding the in db tool using the redshift bulk loader for write actions, which appears as if CREATE TABLE access is required on the schema (which we didn't have on the schema that used to be 1st priority for our user)

Also info on being able to see the SQL generated in the background - http://community.alteryx.com/t5/Data-Preparation-Blending/View-SQL-Generated-by-In-DB-Tools/m-p/5682/highlight/true#M867

All comments

RyanNewsome

*Correction, above would lose records if a action record was joined to a product record but then was discarded due to failing the date check in the filter tool.

I tried doing it this way, but got the same performance issue on the second join tool

1. Use In DB join tool (INNER) to join the 2 tables by personid

2. Use a filter tool on the results from step 1, so only records where actiondate falls between startdate and end date are kept

3. Use In DB join tool (LEFT OUTER) join by actionid, with left side being the original list of action records, so actionids that couldnt be joined to products by personid + date range, are still kept

in DB bottleneck on second join tool.PNG

RyanNewsome

update:

Firslty pushing the results of the filter tool, to a "Write In DB" tool prior to the second join solves the performance bottleneck issue on the second join

This isn't ideal though, because when the workflow is to be run normally, there will be hundreds of millions of rows going to it.

write in db.PNG

Quick Links

Popular Tags

This months top contributors

atcodedog05 19598

Qiu 15910

binu_acs 15783

MarqueeCrew 13710

apathetichell 13703