This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
In life, there are few things black and white. There are gray areas everywhere and the lines that separate can be a little fuzzy. The same holds true for data – especially when it’s human entered. That’s why we have the Fuzzy Match Tool – if your data isn’t clear as day, you can still get value out of your records by matching them to something a little more standardized.
Understanding Join output anchors and when to add a Union tool: The Join tool anchors are separate subsets of data. You can combine them with a Union tool. The Select functionality of the Join tool applies only to the J anchor, not the R or L anchored data.
The Join Tool is the quintessential tool for data blending within Alteryx. As such, it is also one of the most widely used tools. The Join Tool allows you to join data together from two different sources in two different ways: by record position and by specific fields. Selecting by record position will attach the two datasets together where it will match up each record by the position it is in. Thus record 1 of the left dataset will be in the same row as record 1 on the right in the J output and so on. If one dataset from either side has more records than the other those records will not be joined and they will be placed in there corresponding right or left output (L or R). Joining by specific field will match records up based on a specific field or multiple fields. This article goes into how that option works in more depth and detail. I highly recommend it as a read, as it covers frequent behaviors of the tool that you might run into.
Occasionally you may see one of these errors from the Join Multiple tool. It is a result of Cartesian joins.
A Cartesian join is when you join every row of one table to every row of another table. You can also get one by joining every row of a table to every row of itself. A Cartesian join is very CPU intensive.
For example, if you have four files and each file has the same ID in it twice, that means it will join 2*2*2*2 times on the ID (the field on which you're joining is the key referenced in the error; in this example, it's Field1, and the duplicated value is 12345). The same can be caused by multiple nulls in each file.
After your data prep and investigation, and when you know your data are correct, your choices on how to handle Cartesian joins include:
Allow multidimensional joins: The multidimensional join will occur with no error or warning reported.
Warn on multidimensional joins of more than 16 records: A warning will be reported in the Results window that a multidimensional join has occurred.
Error on multidimensional joins of more than 16 records: An error will be reported in the Results window that a multidimensional join has occurred and downstream processing will stop.
How To: Make Left, Right, and Full outer joins
I noticed that a lot of customers in the Alteryx forums were confused about how to make left, right and full outer joins in Alteryx Designer. This KB entry goal is to make it easy and clear.
This is an aggregation in one place of information spread out in multiple sources: Designer help, various discussions on our forums, and public gallery.
First - what does the Join Tool do?
For now, the join tool does a simple inner join with an equal sign. That's it!
In particular: • R output anchor is NOT the result of a right outer join. I know the R letter can make you think this but it is not. • Similarly: L output anchor is NOT a left outer join. I know that got me at first too! See the table below that shows you what each does for the “Join” tool:
Look at the online help to learn more: https://help.alteryx.com/current/Join.htm Second - Can I do outer joins with the Join Tool then?
Yes! You have to combine it with a Union Tool:
- Join Tool + Union of L&J output anchors = Left outer Join,
- Join Tool + Union of R&J output anchors = Right outer Join,
- Join Tool + Union of R+L+J output anchors = Full Join.
Look at the online help to learn more: https://help.alteryx.com/current/Join.htm Third - Download "Advanced Join" tool from Public Gallery
Now, Alteryx is very flexible as you can create your own tools or macros. You can find a lot of these on the public gallery.
There is a tool called "Advanced Join" that Neil Ryan created that works well for more complex joins (including left, right, full/cross etc). You can also apply a filtering condition.
You can find it here: https://gallery.alteryx.com/#!app/Advanced-Join/547f8df96ac90f0f2ca5e439 Download it, follow instructions to install it. It will show up first in the category "Join" in your Designer toolbar.
Fourth - Use "Multiple Join" for Full outer join
In designer click on "open example" link that appears once you click on the "Multiple Join" tool for detailed examples.
As Alteryx improves the "Join Tool", we might see an easier way to do outer joins. But for now these are the few options you have.
You monitor the mileage of multiple trucks as they deliver shipments over the course of a week and record additional information regarding each truck in a file (Truck Metrics). Each truck’s cumulative mileage per day is recorded in a separate file (Truck Mileage). Your goal is to update the mileage-related fields in Truck Metrics with the values recorded in Truck Mileage. Note: today is Tuesday so only fields for Monday and Tuesday will be updated in in the Truck Metrics file.
Manually Select Fields to Update
Whether the data is uniquely identified by a Truck Number (or Record ID) or identically ordered in both files, data from Truck Metrics (the table to be updated) and Truck Mileage (the data used for updating) can be Joined together. Then, using the Select functionality within the Join tool, I can manually choose the fields from Truck Mileage that I want to use to replace the fields that need to be updated (Figure 1).
Figure 1: Fields from Truck Mileage (Yellow) replace fields from Truck Metrics (Blue). Note that fields in yellow are selected while fields in blue have been deselected. Fields that need to be included from Truck Metrics (Red) remain selected.
Fantastic! A simple, straightforward way to update fields! But, as any analyst knows, working with data is rarely simple or straightforward. What if you’re dealing with 20 fields that need to be updated, not just 2? In that case, manually selecting fields to update is not only tedious but also error-prone. For these types of situations, I recommend a process that allows for a more dynamic approach.
'Verticalize' the Data to Dynamically Update Fields
Transposing, or ‘verticalizing’ data, allows for a more dynamic workflow when you have unknowns in your processing. In a scenario such as this one, you may have an unknown or changing number of fields that will need be updated in Truck Metrics. Using this approach, we’ll first Transpose both data sets to configure the Field Name and its associated value in a single row (Figure 2).
Figure 2: The data is transposed from Truck Mileage and Truck Metrics. The highlighted fields in Truck Metics indicates which fields will be updated, as the same field exists in Truck Mileage.
Then, we’ll Join our datasets based on two fields: Truck Number AND Name. This ensures that fields in Truck Mileage will match to the correct fields in Truck Metrics, assuming the fields names in both tables are named in the same way. The only selecting we’ll have to do is to make sure all fields from the Right data source (in this case, Truck Metrics) are deselected (Figure 3). This allows that, in the situation of matched Truck Numbers and field names, the updated values will be used.
Figure 3: The Joined fields indicate the fields that exist in both Truck Mileage and Truck Metrics. Fields in yellow (Truck Mileage) are selected to reflect updated values in downstream tools. Fields in blue (Truck Metrics) are deselected.
Note that any unmatched fields have fallen out of the Right side of the Join:
To add them back into the data stream, simply Union the Center and Right Joins together, setting the tool to “Auto Configure by Name”. Then, to rearrange the data into its original format, use the Cross Tab tool (Figure 4). And, voila! Your data is updated!
Figure 4: The updated fields are highlighted in yellow. All other fields originally included in Truck Metrics are included the in the final results as well.
Let's talk about how to replicate the WHERE EXISTS functionality of SQL within Alteryx.
Example 1: UNION - Add rows from table#1 to table#2 if the key value of table#1 does not exist in table#2. If desired, the combined data set could be joined with a third data set, but only if the key value in table#3 does not exist with only one key column.
Example 2: SELECT/UPDATE records from table #1 based on the contents of table #2. The statement below generates the names of customers who had orders during 2016.
from customers c1
where 1 = 1
and exists (select *
from customer_orders c2
where 1 = 1
and c1.customer_number = c2.customer_number
and c2.order_year = 2016
The SQL EXISTS condition is used in combination with a subquery and is considered to be met if the subquery returns at least one row. It can be used in a SELECT, INSERT, UPDATE, or DELETE statement.
The syntax for the EXISTS condition in SQL is: WHERE EXISTS (subquery)
The subquery is a SELECT statement. If the subquery returns at least one record in its result set, the EXISTS clause will evaluate to true and the EXISTS condition will be met. If the subquery does not return any records, the EXISTS clause will evaluate to false and the EXISTS condition will not be met.
Example 1: Take the unmatched records from the source you are appending data from out of a join between the two data sets and union it back to the data set you are appending to.
Example 2: Create your "look up list" out of a filter for whatever you set as the condition. In this example, a year that is in the data. Join the filtered data back with the other data source. Deselect the data that comes from the filtered source.
Please see the attached workflow: Where Exists Question.yxmd.
In SQL, you can join on a range, using code similar to the below snippet.
SELECT Column, FruitName, StartDateTime, EndDateTime
INNER JOIN dbo.Greek_Alphabet ON dbo.DateTime BETWEEN dbo.FruitList_StartDateTime AND dbo.FruitList_EndDateTime
Alteryx does not have a "join on range" command, but you can get the same result by appending all records and using a filter tool.
For Advanced Joins: when a value from one file is between (< or >) a value from another file, visit and download the Advanced Join Macro.
It's not uncommon to have a situation where you need to conditionally join one dataset to another. Perhaps the most common is when you want to join one file to another where a date from the first file is between, greater than or less than a date(s) on a second file. The default tools found in the Join section of the tool palette don't provide a simple way of doing this (it can be done, but you need to string several tools together to make it work. There is a better way! Read on...).
There is great macro available in the public Alteryx gallery called Advanced Join (find it here, but spoiler alert... you can download the attached workflow which includes this macro so you don't have to go to the gallery to get it). The Advanced Join gives you greater latitude than the Join tool. Most notably, you can select records from file A that are unique to A AND intersect with file B. Now you may be thinking, “I can do that by unioning the records from an inner join with records from a left join,” and you would be correct. But it takes two tools to do what one Advance Join does. More importantly, the Advanced Join allows you to put a conditional statement on your join which is something you can't do with the Join tool. And it’s this feature - the ability to use conditional statements in a join - which we will focus on for our purpose here.
Let's get into some examples. I have a file, 'Fruit List’, which contains data about various fruits. This file contains a Column Id, a Fruit Name, a Start DateTime and an End DateTime:
I have a second file, 'Greek Alphabet’, which contains a Column Id, a Greek letter and a Datetime.
I want to join the two files on ColumnId where the Datetime from Greek Alphabet (file B) is BETWEEN Start Datetime and End Datetime from Fruit List (file A). Here's the workflow and a screenshot of how to configure the Advanced Join:
And here are what my results look like:
Only one record from Greek Alphabet matched one from Fruit List on ColumnId where Greek Alphabet's Datetime was between Fruit List's Start Datetime and End Datetime.
In the next example, I have the same Fruit List file and want to join it another file, Greek Alphabet that contains just one datetime filed:
The first thing to note is both files have a field called 'DateTime.' We'll want to give these unique names to avoid ambiguity when we write our conditional state in the Advance Join configuration.
I want to join both files on ColumnId but only when DateTime from Fruit List is LESS THAN DateTime from Greek Alphabet:
And the results...:
Let's look at one last example. This time, I'm going to use the Fruit List and Greek Alphabet files used in the first example (Fruit List has a Start DateTime and an End DateTime). I'm interested in matching records where DateTime from Greek Alphabet is BETWEEN Start Datetime and End DateTime from Fruit List. I'm not matching on ColumnId this time.
For the Advanced Join configuration, I'm going to cross join my files. (CAUTION: the resulting join could contain as many rows as the product of the number of rows of the incoming datasets - a.k.a. Cartesian join - depending on how restrictive your conditional is. This means if you're joining 2 datasets that contain a million records each, the resulting dataset could contain as many as one trillion records! ). If I had wanted to match on ColumnId, I would have had to do that separately using a Join tool. The cross join option only allows you to apply a conditional statement:
Results from our 3rd example:
Notice how 10 records from Greek Alphabet were joined to just one record from Fruit List.
The Advanced Join tool can save you time and a lot of headaches when you want to join files using a conditional statement. It has some limitations - you can only join two datasets and include one conditional statement per tool, cross join limitation mentioned above - but Advanced Join provides greater capability and flexibility than the standard Join tool.
In a recent article (Create an Indexed Map), I mentioned the indexed maps found in the Rand McNally Road Atlas. Well, also found in the Road Atlas are mileage charts, or, distance matrices. These matrices can be easily created in Alteryx. The example below will provide the distance between every store in a dataset to every store in that same dataset. Here's how we did it.
Distance Matrix/Mileage Chart Example
Create a Cartesian Join of your data
Using the Append Tool, create a Cartesian Join of all of the records in the dataset.
This will give you a combination of every record to every record in your dataset.
Don't forget to "Allow All Appends" (learn more about creating Cartesian Joins here).
Measure the distances between all record combinations.
Using the Distance Tool, measure the distances between the point combinations.
Flip the data into columns using the Cross Tab Tool.
Use a Select Tool to change the Store Number column to a string.
Doing this will prevent the Table Tool from adding commas to this field.
Use a Table Tool to create a formatted table.
Add a Column Rule to the Store Number field to format the column as bold.
Create a Row Rule in order to force a one decimal place to the distance data.
Here's the workflow which you can also find attached:
That's it. Feel free to leave any comments or ask any questions.
One common reason can be due to mixed cases within the data of the join fields between the two datasets. For example, and as shown below, "Orange" from the first dataset is not going to join to "orange" in the second dataset. And this does not just apply to the leading character. Another common reason can be due to the existence of trailing or leading spaces. As also shown in the example below, "Plum " is not going to join with "Plum". The little red triangle in the upper right-hand corner of "Plum" indicates that there are leading and/or trailing spaces.
While the Join tool is easily one of the most used tools in Alteryx, it can also be one of the most misunderstood. This is even more likely true if a new user hasn’t previously used joins in any other data-manipulating platform or they are joining big tables where they might not be keeping track of the records inside the fields they are joining on.
Did you ever play that party-game called 6-degrees of separation where you have to figure out how you are connected to someone famous in less than 6 relationships (or the movie version – 6 degrees of Kevin Bacon)? Well that game just got a whole lot easier with the Make Group Tool!
The Find Replace Tool is one of those tools that goes relatively unused and uncelebrated until you stumble into a data blending technique that would be extremely difficult without it – at which point, it becomes your favorite tool in the Designer. You can find it in the Join Category and it’ll make easy string substitutions in your data that would otherwise require herculean effort to work around. Today, we celebrate Find Replace as a hero.