Why Your Join Is Getting More Records than Expected

Alteryx
Created

While the Join tool is easily one of the most used tools in Alteryx, it can also be one of the most misunderstood. This is even more likely true if a new user hasn’t previously used joins in any other data-manipulating platform or they are joining big tables where they might not be keeping track of the records inside the fields they are joining on.

Thus I will go over three different cases of joining two tables on one common field that you may encounter and what to expect in your results.

The workflow I will go over (attached in this article) is a workflow I made in 10.5 and contains the same visuals that are below.

Case #1: Joining by fields that both have unique records

This is the simplest case and the easiest to understand. In this case we are joining by the fields named “Fruit” in both inputs. The join will align the records of both tables if the records in the “Fruit” fields match.

So the expected joined table (J Output) should look like this:

Fairly straightforward

Case #2: Joining by fields where only one has unique records

For this case we have two records for “Banana” in our Left input. This means the record with “Banana” in the Right input will match up twice and get appended to both records from the Left input. Our Output should look like this:

Case #3: Joining by fields that both have non-unique records

This one can be the most challenging to understand but hopefully after looking at this visually and mathematically you will understand why this makes sense and what you can do to avoid unwanted records.

In this case we have two records in “Fruit” that both have “Banana” in both tables. The same thing happens here just like the case above except now it happens twice because we have two records from the Right input that match up with two records from the Left input. So our output will have a total of 4 records that have the word “Banana” in it.  The output should look like this:

To look at this mathematically for both Case #3 and Case #2, you will know how many records you will get for each record instance by multiplying the amount of records which appear in both tables. In the example above since we are joining on “Fruit” and we see the record banana twice in both tables you can think of it as 2 bananas x 2 bananas=4 bananas.

If we added another row on our Right input like so:

We now have 3 x 2 so we should expect 6 combinations of records with “Banana” in the “Fruit” field

If you have joined two tables and you are getting more records then you expected then this is the most likely culprit of your problem. If one doesn't understand the cases presented above this could certainly drive anyone bananas (pun intended). For best practices, I would follow these two steps before joining any tables:

1st: Check your tables for any duplicate records.

For the case above since we have three duplicates from our Right input we can just use the unique tool and get rid of these duplicates.

The output will be the same exact output as the output from Case #2 since there will only be one record of “Banana” after the unique tool.

2nd: See if joining on multiple fields is the more appropriate option.

Example: We have Table 1 with two records containing ‘Banana’ for “Fruit” and two records containing ‘X’ for “Store ID” with “FruitID” corresponding with “Fruit”.  Table 2 has both the same “Fruit” and “Store ID” fields except now we have a “Store employees” field which pairs up with the “Store ID”.

Table 1                                                                                  Table 2

We want all of this data together so that it all pairs up nicely.  Ideally we should only have four records come out of our join as we want to get the ”Store Employees” and “FruitID” fields in one table. However, if we were to only join on one common field whether it’s “Store ID” or “Fruit”, we would get combinations and the data wouldn’t be aligned properly. See below.

Joining on “Fruit”

In this case we get two records where our Store IDs don’t match up.

Joining on “StoreID”

In this case we get two records where our Fruits won’t match up.

To correct this, we will configure our join to match on both Fruit fields and Store ID fields.

After running this, the joined table should look like this:

When joining with multiple fields we make sure that records from both of these fields match before joining.

An easy way to think about this is imagine that ”Fruit” and “Store ID” are being merged together. Now all of a sudden we have unique records because we would now have Banana X and Banana Y as two unique entities and our join will now become a 1 record x 1 record join.

9 - Comet
Hi, Thanks for this post. It helps to understand the Join Tool better. If you can also help provide the solution to one more variation to Case #3: Joining by fields that both have non-unique records. I have got two tables containing Fruit related information as below - Table 1 Table 2 Fruit Unique_Id Fruit Unique_Id Apple X Apple A Banana K Banana B Banana K Banana D Cantaloupe Z Cantaloupe C On joining both the tables on Fruit Column the output would look like - Fruit Unique_Id Right_Fruit Right_Unique_Id Apple X Apple A Banana K Banana B Banana K Banana D Banana Y Banana B Banana Y Banana D Cantaloupe Z Cantaloupe C Talking of Unique Ids, the requirement is to get either K,B and Y,D together or K,D and Y,B together. i.e. since the source data itself is duplicated, it is OK in output to have duplicates in output. There is no way to know actually whether K is linked to D or B and same goes with Y. Unfortunately there are no other join columns apart from Fruit, any other combination for K,Y,B and D would be deemed incorrect. For e.g. if the Output is K,B and K,D it would imply one record from Table 1 got joined to two records from Table 2. Again K,B and Y,B would imply same as above with the Tables being swapped. Idea is to get 1-1 mapping between the two tables, exact linking of records is not a concern. Thanks, Rohit Bajaj
5 - Atom

Hi OP,

Thanks for walking through the join tool in detail! I'm running into a weird problem still, basically I have 2 files that has the same number of records, they have the same identifiers, 1 file just has a couple more columns I want to bring in and add to the other file. I used the join tool with multiple fields of identifiers to be as specific as possible, but the total resulted in the join (no left or right exceptions) were more than double the amount of records in the 2 files. How is this possible? Can someone give me some pointers?

Alteryx

Hi @amieding, it is probably because the identifiers you are using are not unique. As explained in the post above, if you join on non-unique identifiers the join tool will multiply your records because of the number of possible joins between those non-unique records. I would add a unique tool before both left and right input connection to the join tool and do a unique on ALL the identifiers. If you see duplicates after the "D" output, this will confirm the likely cause.

Hope this helps!

Ozzie

5 - Atom

Thank you Ozzie! I get what you're staying but the 2 files originally came from the same file and the only difference is the extra columns, that's why I'm confused as to why it created a significantly larger record amount. I will try using a unique tool and see if that helps. Thanks again!

6 - Meteoroid

Hi, hoping you can help.  I'm using the join tool and getting exactly the problem you have described here.  I have unique records in my first file, and should have unique records in my second file, but that isn't always the case.  When there are duplicate records the additional field has multiple values (so it would be like your example above where Banana is linked to both store X and store Y).  Where that happens I need to strip out one of the values so that the files both contain unique records, and ideally I would use the Unique tool, but the value which needs to be stripped out depends on information which isn't contained in the file, so I can't see that the Unique tool would work for me.  Is there any way I can get the workflow to show me somehow when the number of records from my first output is different from the number of records in the post join output?

I have tried putting a count on both outputs, putting those through a join tool and attaching Expect Zero Records tools to both the left and right outputs from that, but it feels very clumsy, and I'm sure there must be a better way of picking it up.

Any ideas would be gratefully received.

7 - Meteor

Joining on 2 fields is like doing a Concatenate formula in Excel and then performing a Vlookup

6 - Meteoroid

Hi,

This article is a good one and provides inputs to the working of Join tool.. I am a newcomer to the world of data and am learning a tool for the first time but these articles are making life bit easier for me. Thanks again..