Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Advanced Join Question

JohnJPS
15 - Aurora

I ran into a situation while translating from some code from SAS/SQL to Alteryx, and they were left joining tables with three conditions; the first was a simple join TableA.x1 = TableB.y1, then had a condition that TableA.x2 be between, say, TableB.min2 and TableB.max2, and finally a third condition where TableA.x3 >= TableB.x3.  (Naturally you could imagine other scenarios even more nefarious, but this will suffice for now).

 

The Advanced Join macro could almost solve this - I needed to join back to the original TableA in order to ensure the outer join still included everything from TableA.  However, my TableA in real life has over 100 columns, and my "join back" would need to join on all of them, which is more pointing and clicking than is reasonable.

 

 

I also tried using the R tool (my favorite), and SQLDF, in order to do it, and that works; however in addition to having over 100 columns, I also had over a million rows, so it not only took a while to complete (a few minutes - not that bad really), but also gave me an error that left me a little nervous, even if perhaps it's nothing to worry about: same error as seen here. So it worked, but also not ideal.

 

I simplified and generalized all this in the attached workflow, which looks like:

Capture.PNG 

 

This doesn't have the sheer number of columns or rows, but illustrates the complexity of the join conditions.

 

So the question is, is there an easier way to do left joins of arbitrary complexity... that also scales well to both many columns and many rows?

 

9 REPLIES 9
MarqueeCrew
20 - Arcturus
20 - Arcturus

@JohnJPS,

 

I looked at your module and found that NULL values might be inflating your results.  You've got 3 records with NULLs for min2 and max2 and y3 that might not qualify according to your rule set.

 

My K.I.S.S. approach to this would be to JOIN on the EQUALITY of TableA.x1 == TableB.y1 and then filter for the "Between" condition and the "Greater" condition being TRUE.

 

Just a thought,

 

Mark

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
JohnJPS
15 - Aurora

@MarqueeCrew  (nice current icon by the way),

I believe they want to retain all rows from the first table, so the NULL are expected; I probably should have added added another column in TableB that was "some data we want to have if it's there based on the join conditions."

 

Although... that does rephrase the problem as, essentially these are just lookup tables with complex lookup conditions... perhaps I can rewrite the whole approach based on that.  Thanks for your response!

 

jdunkerley79
ACE Emeritus
ACE Emeritus

This feels similar to the VLookup question that comes up in the community all the time, which I think @JohnJPS and me had a go at using the R tool. 

 

Looking inside the advanced macro it does a cartesian join, so that feels like no way it can be performant on a massive dataset.

 

This has been one I have been mean to try to within the .Net API but even there will have difficulties due to needing to cache in memory large datasets to do the join. WIll hopefully get back to trying this weekend.

 

Looking at the specific case, could you generate rows for all the x2 values and then do a standard join on x1 and x2 and then use a filter for the x3 clause.

 

Again just some quick thougts. Feels like worth generating a representative scale dataset then we can all see who can make the best solution :)

JohnJPS
15 - Aurora

@jdunkerley79

This data set (download from kaggle from their Homesite Quote Conversion competition) isn't quite big enough to break R, but it did break the Advanced Join Macro, or I did something stupid - entirely possible.  I first converted it to a .yxdb, and then am using it in the attached workflow. Even though it doesn't break R, it should be large enough to at least compare alternative solutions.  My "TableB" is still just dummy cooked up data.

jdunkerley79
ACE Emeritus
ACE Emeritus

Simple generate rows example takes 10.9s to run for kaggle set from CSV

Will be quicker if I save as YXDB am sure

 

8s with a yxdb source on my Surface

JohnJPS
15 - Aurora

@jdunkerley79

I think we just nailed it!  Your version lacks the "left outer join" aspect in that it doesn't retain all rows of the original file.  However - and I'm not sure how I missed this previously - all we need to do is generate a record ID on the file (unique on every row), and join back on that key after our complex conditions are met. This has been done in the attached module.

 

Pretty exciting: this runs really fast and fully accomplishes a left outer join using arbitrarily complex conditions.

 

(Aside: the "auto-field" tool will shrink the .yxdb considerably and get everything running lightning quick, too.)

 

jdunkerley79
ACE Emeritus
ACE Emeritus

One slight tuning to get that little bit more...

 

Use a union to get back to full record set then no need for record id.

 

With the auto fielded yxdb shaves about 10% off (3s vs 3,3s on mine)

 

Still want to build a decent advance join tool... this trick wont work on large domain of values.

JohnJPS
15 - Aurora

Using union required also a distinct tool, to pare down the rows returned by the union (unless I missed a setting); in the end for me the RecordID approach was slightly faster (3.7s worst case vs. 4.2s best case with the union after 5 runs of each; on a ho-hum laptop with no SSD).

jdunkerley79
ACE Emeritus
ACE Emeritus

Dont think would get dupes but will be losing the rows in the filter, so yep wont quite work.

 

 

 

Labels