Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Cartesian Product Nature of Inner Join

JohnJPS
15 - Aurora

Here is a simple illustration of how the inner portion of a join will actually turn into a cartesion join (cross product) on any selected fields not taken care of in the join conditions. The example has two small data sources of 3 fields and 4 rows each; these are joined on two of the fields, and all outputs are selected. The "Left" and "Right" portions are as expected, but the "inner section" turns into a mess of numerous rows.

 

This may be obvious to anyone who has written a lot of joins (e.g. in SQL maybe), but the point is that these concerns remain very real even when using a tool that manages so many other concerns.

 

In short - make sure that row counts coming out of a join are exactly as expected.  If they aren't, there is probably something of this nature taking place, and it is probably not good.  In this example, it's trivial to see, but this example is based upon review of a "real" workflow containing a similar situation that would have been missed had the row counts gone unquestioned.

 

joinExample.png

4 REPLIES 4
JohnJPS
15 - Aurora

(More of a comment than a question... sorry.)

RodL
Alteryx Alumni (Retired)

John,

You raise a good underlying point in that whatever tool you are using to analyze your data, it is important to actually have an understanding of your data. In your example case, you have two tables you are joining with multiple records in each table for your primary key (in your case F1 & F2 make up the primary key), essentially a many-to-many relationship which will provide the results you show as expected.

 

It's easy to see this in your example, but can become problematic when you are dealing with large data sets. What's good about Alteryx is it provides the analyst some easy ways to understand their data (even in multi-million record sets). For example in this case, you could use either the Unique or Summarize tools to determine whether you have duplicate records represented by the primary key (and make adjustments to your process to account for this).

 

Thanks for raising the point that if you aren't paying attention to the data you are working with, you can get yourself into a lot of trouble. 

Rod

dataMack
12 - Quasar

One of my favorite things about Alteryx is how you can have multiple outputs from a given node.  This allows you to build checks and validations into your workflow.  In your example above, one might use a summarize tool or record count tool before and after the join and then compare those values to ensure you are getting the results you expect.

 

Always a good practice to make sure you understand your data and have some understanding of the results you expect to see.  Again, this is where Alteryx shines in comparison to other tools becuase I can view my data at every step along the way (thanks again for browse everywhere in 10.0!)

Phil_L57
7 - Meteor

Hi John,

 

I really like your post about the issue with Cartesian joins. I've just made a macro that can be used on a dataset to decipher which fields are causing the duplication. It's not always obvious why the data has duplicated and I'm hoping this tool will be useful for people struggling with this issue. The link to the Macro is here:

https://gallery.alteryx.com/#!app/One-to-Many-Field-Locator/573f22c53df7da0ed85985a9

Labels