This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
05-26-2016 01:59 PM - edited 08-03-2021 01:11 PM
While the Join tool is easily one of the most used tools in Alteryx, it can also be one of the most misunderstood. This is even more likely true if a new user hasn’t previously used joins in any other data-manipulating platform or they are joining big tables where they might not be keeping track of the records inside the fields they are joining on.
Thus I will go over three different cases of joining two tables on one common field that you may encounter and what to expect in your results.
The workflow I will go over (attached in this article) is a workflow I made in 10.5 and contains the same visuals that are below.
Case #1: Joining by fields that both have unique records
This is the simplest case and the easiest to understand. In this case we are joining by the fields named “Fruit” in both inputs. The join will align the records of both tables if the records in the “Fruit” fields match.
So the expected joined table (J Output) should look like this:
Fairly straightforward
Case #2: Joining by fields where only one has unique records
For this case we have two records for “Banana” in our Left input. This means the record with “Banana” in the Right input will match up twice and get appended to both records from the Left input. Our Output should look like this:
Case #3: Joining by fields that both have non-unique records
This one can be the most challenging to understand but hopefully after looking at this visually and mathematically you will understand why this makes sense and what you can do to avoid unwanted records.
In this case we have two records in “Fruit” that both have “Banana” in both tables. The same thing happens here just like the case above except now it happens twice because we have two records from the Right input that match up with two records from the Left input. So our output will have a total of 4 records that have the word “Banana” in it. The output should look like this:
To look at this mathematically for both Case #3 and Case #2, you will know how many records you will get for each record instance by multiplying the amount of records which appear in both tables. In the example above since we are joining on “Fruit” and we see the record banana twice in both tables you can think of it as 2 bananas x 2 bananas=4 bananas.
If we added another row on our Right input like so:
We now have 3 x 2 so we should expect 6 combinations of records with “Banana” in the “Fruit” field
If you have joined two tables and you are getting more records then you expected then this is the most likely culprit of your problem. If one doesn't understand the cases presented abovethis could certainly drive anyone bananas (pun intended). For best practices, I would follow these two steps before joining any tables:
1st: Check your tables for any duplicate records.
For the case above since we have three duplicates from our Right input we can just use the unique tool and get rid of these duplicates.
The output will be the same exact output as the output from Case #2 since there will only be one record of “Banana” after the unique tool.
2nd: See if joining on multiple fields is the more appropriate option.
Example: We have Table 1 with two records containing ‘Banana’ for “Fruit” and two records containing ‘X’ for “Store ID” with “FruitID” corresponding with “Fruit”. Table 2 has both the same “Fruit” and “Store ID” fields except now we have a “Store employees” field which pairs up with the “Store ID”.
Table 1 Table 2
We want all of this data together so that it all pairs up nicely. Ideally we should only have four records come out of our join as we want to get the ”Store Employees” and “FruitID” fields in one table. However, if we were to only join on one common field whether it’s “Store ID” or “Fruit”, we would get combinations and the data wouldn’t be aligned properly. See below.
Joining on “Fruit”
In this case we get two records where our Store IDs don’t match up.
Joining on “StoreID”
In this case we get two records where our Fruits won’t match up.
To correct this, we will configure our join to match on both Fruit fields and Store ID fields.
After running this, the joined table should look like this:
When joining with multiple fields we make sure that records from both of these fields match before joining.
An easy way to think about this is imagine that ”Fruit” and “Store ID” are being merged together. Now all of a sudden we have unique records because we would now have Banana X and Banana Y as two unique entities and our join will now become a 1 record x 1 record join.
Hi OP,
Thanks for walking through the join tool in detail! I'm running into a weird problem still, basically I have 2 files that has the same number of records, they have the same identifiers, 1 file just has a couple more columns I want to bring in and add to the other file. I used the join tool with multiple fields of identifiers to be as specific as possible, but the total resulted in the join (no left or right exceptions) were more than double the amount of records in the 2 files. How is this possible? Can someone give me some pointers?
Thanks in advance!
Hi @amieding, it is probably because the identifiers you are using are not unique. As explained in the post above, if you join on non-unique identifiers the join tool will multiply your records because of the number of possible joins between those non-unique records. I would add a unique tool before both left and right input connection to the join tool and do a unique on ALL the identifiers. If you see duplicates after the "D" output, this will confirm the likely cause.
Hope this helps!
Ozzie
Thank you Ozzie! I get what you're staying but the 2 files originally came from the same file and the only difference is the extra columns, that's why I'm confused as to why it created a significantly larger record amount. I will try using a unique tool and see if that helps. Thanks again!
Hi, hoping you can help. I'm using the join tool and getting exactly the problem you have described here. I have unique records in my first file, and should have unique records in my second file, but that isn't always the case. When there are duplicate records the additional field has multiple values (so it would be like your example above where Banana is linked to both store X and store Y). Where that happens I need to strip out one of the values so that the files both contain unique records, and ideally I would use the Unique tool, but the value which needs to be stripped out depends on information which isn't contained in the file, so I can't see that the Unique tool would work for me. Is there any way I can get the workflow to show me somehow when the number of records from my first output is different from the number of records in the post join output?
I have tried putting a count on both outputs, putting those through a join tool and attaching Expect Zero Records tools to both the left and right outputs from that, but it feels very clumsy, and I'm sure there must be a better way of picking it up.
Any ideas would be gratefully received.
Joining on 2 fields is like doing a Concatenate formula in Excel and then performing a Vlookup
Hi,
This article is a good one and provides inputs to the working of Join tool.. I am a newcomer to the world of data and am learning a tool for the first time but these articles are making life bit easier for me. Thanks again..
@OzzieThis is a great summary....
Just one question, I've been using other tools and traditionally the fields I've joined on need to be the same size (ex 10 character string). Does Alteryx handle this differently? Can I join a 10 Character string with a 12 character string or should I be performing more cleanup. In this example I also know that there are no strings longer that 10 characters.
Thanks! George
I have a strange situation where i am trying to join a true and false filter output against different criteria and the sum of my left, middle and right output is less than the total incoming values from the filter input - what could be causing that - any clues?
@ck2019
Sorry, I'm not great at visualizing what you're doing based on that description. Would you be able to provide a screenshot or better yet a workflow with data that shows what you're seeing?