This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
The Find Replace Tool is one of those tools that goes relatively unused and uncelebrated until you stumble into a data blending technique that would be extremely difficult without it – at which point, it becomes your favorite tool in the Designer. You can find it in the Join Category and it’ll make easy string substitutions in your data that would otherwise require herculean effort to work around. Today, we celebrate Find Replace as a hero.
While the Join tool is easily one of the most used tools in Alteryx, it can also be one of the most misunderstood. This is even more likely true if a new user hasn’t previously used joins in any other data-manipulating platform or they are joining big tables where they might not be keeping track of the records inside the fields they are joining on.
The Fuzzy Match Tool provides some pretty amazing flexibility for string joins with inexact values – usually in the case of names, addresses, phone numbers, or zip codes because many of the pre-configured match styles are designed around the formats of those types of string structures. However, taking advantage of the custom match style and carefully configuring the tool specific to human entered keyword strings in your data can also allow you to use the loose string matching feature of the tool to match those values to cleaner dictionary keyword strings. If done properly, it can help you take otherwise unusable strings and, matching by each individual word, recombine your human entered data to a standardized format that can be used in more advanced analyses.
In life, there are few things black and white. There are gray areas everywhere and the lines that separate can be a little fuzzy. The same holds true for data – especially when it’s human entered. That’s why we have the Fuzzy Match Tool – if your data isn’t clear as day, you can still get value out of your records by matching them to something a little more standardized.
The Auto Field Tool : a tool so easy you don’t have to do anything – just put it on your canvas and viola. Automatically optimized data types. If you’re running into data type related issues and errors in your workflows, or just looking to add some speed or reduce the occupied disk space your data is hoarding – look no further than the Preparation Tool Category ’s Auto Field Tool, which reads through all the records of an input and sets the field type to the smallest possible size relative to the data contained within the column.
Question How can I create a Cartesian join of my data?
Answer A Cartesian join, also known as a Cartesian product, is a join of every row of one table to every row of another table. For example, if table A has 100 rows and is joined with table B, which has 1,000 rows, a Cartesian join will result in 100,000 rows. This type of join can be useful when you need to produce a dataset that contains every combination of two or more tables. You can continue to add to the join by adding additional Append Tools. Simply connect the output of the first Append Tool to the Target Input of a second Append Tool and connect the third table to the Source Input of the second Append Tool. Continue this configuration for as many tables as necessary.
In the example below, a car dealer needed a list of every combination of model, exterior color, interior color and options package. Here is what the four table Cartesian join looks like:
The tables contain the following data:
A sample of the results from the workflow pictured above looks like this:
By default, the Append Tool is set to error on more than 16 appends in order to protect the user from inadvertently producing an extensive amount of records. If your data has more appends than this, you can change this setting as shown below to allow all appends:
This workflow, created in 10.6, is attached.
VLOOKUP is one of the most commonly used functions is Excel. VLOOKUP takes a lookup value and finds that value in the first column of a lookup range. Complete the function's syntax by specifying the column number to return from the range. In other words, VLOOKUP is a join. One column of data is joined to a specified range to return a set of values from that range.
The objective of this posting is twofold:
1. Demonstrate how a VLOOKUP can be done using Alteryx.
2. Demonstrate how a VLOOKUP can actually be easier to do in Alteryx.
For the sake of argument, let's say we have two worksheets in one Excel file. Sheet 1 contains a master list of sales data. The data looks like this:
Sheet 2 contains a select list of sales people and contains First Name and Last Name:
Using a VLOOKUP function, we want to take the list of sales people, match it to sales people found in Sheet 1 and return sales data for each sales person. Notice our list of sales people in Sheet 2 has two sales people named 'Angelina'. Sheet 1 has multiple people named 'Angelina' and 'Michael'. Using First Name alone will not return reliable results using VLOOKUP. Further, matching on Last Name - or even a concatenation of both First Name and Last Name - isn't always reliable either. There are a lot of 'Michael Thomas's in the world and several may exist in your data range. The best way to complete our VLOOKUP is to use a unique identifier for each sales person. But again, for the sake of argument, let's say each First Name and Last Name combination produces a unique name/identifier.
The first thing we need to do is do is concatenate First Name and Last Name in a new field.
We'll do the same thing with our data in Sheet 2. For VLOOKUPs to work properly, the data must be sorted. In our case, we'll sort both Sheet 1 and Sheet 2 in ascending order on 'Concatenated Name':
Now we're ready to use our =VLOOKUP function:
It worked! Let's see how do the same thing in Alteryx. (Hint: there's more than just one way to do it).
We'll start by bringing in our data from Sheets 1 and 2:
As with the Excel example, we'll concatenate First Name and Last Name in a new field. For clarity, let's name the concatenated name in Sheet 1 'Full Name' and Sheet 2 'New Name'. We'll do this using a Formula Tool expression:
Finally, let's attach a Find Replace Tool where Sheet 2 attaches to the 'F' input of Find Replace and Sheet 1 attaches to the 'R'.
The configuration for the Find Replace looks like this:
Notice 'Sales' is selected under 'Append Field(s) to Record'. When we run the workflow, we get the sales data by sales person just like we did in Excel:
But there's an even simpler way to perform a VLOOKUP in Alteryx! Simply join the two Sheets with a Join Tool.
Take a look at how the Join is configured:
We joined our data on First Name and Last Name without having to concatenate the two fields first. Further, we selected the sales data we want returned in the same Join Tool.
Alteryx has an added advantage over Excel's VLOOKUP function. Let's say you want to return Sales and Opportunities. With Alteryx, this can be handled in a single Join. With Excel, we would need to do multiple VLOOKUPs.
With the Input Data Tool (master it here ), our Connector Tools , and the Download Tool (master it here ), data in the Designer is aplenty. But what about manually entered user data? In analytics, we’re often trying to avoid human-entered data (unless we’re cleaning it) because it is more prone to error. In spite of the risks, there are a number of situations that manually entered data can be useful in Alteryx. However, it’s dangerous to go alone; take the Text Input Tool and simplify those instances with the techniques below:
You monitor the mileage of multiple trucks as they deliver shipments over the course of a week and record additional information regarding each truck in a file (Truck Metrics). Each truck’s cumulative mileage per day is recorded in a separate file (Truck Mileage). Your goal is to update the mileage-related fields in Truck Metrics with the values recorded in Truck Mileage. Note: today is Tuesday so only fields for Monday and Tuesday will be updated in in the Truck Metrics file.
Manually Select Fields to Update
Whether the data is uniquely identified by a Truck Number (or Record ID) or identically ordered in both files, data from Truck Metrics (the table to be updated) and Truck Mileage (the data used for updating) can be Joined together. Then, using the Select functionality within the Join tool, I can manually choose the fields from Truck Mileage that I want to use to replace the fields that need to be updated (Figure 1).
Figure 1: Fields from Truck Mileage (Yellow) replace fields from Truck Metrics (Blue). Note that fields in yellow are selected while fields in blue have been deselected. Fields that need to be included from Truck Metrics (Red) remain selected.
Fantastic! A simple, straightforward way to update fields! But, as any analyst knows, working with data is rarely simple or straightforward. What if you’re dealing with 20 fields that need to be updated, not just 2? In that case, manually selecting fields to update is not only tedious but also error-prone. For these types of situations, I recommend a process that allows for a more dynamic approach.
'Verticalize' the Data to Dynamically Update Fields
Transposing, or ‘verticalizing’ data, allows for a more dynamic workflow when you have unknowns in your processing. In a scenario such as this one, you may have an unknown or changing number of fields that will need be updated in Truck Metrics. Using this approach, we’ll first Transpose both data sets to configure the Field Name and its associated value in a single row (Figure 2).
Figure 2: The data is transposed from Truck Mileage and Truck Metrics. The highlighted fields in Truck Metics indicates which fields will be updated, as the same field exists in Truck Mileage.
Then, we’ll Join our datasets based on two fields: Truck Number AND Name. This ensures that fields in Truck Mileage will match to the correct fields in Truck Metrics, assuming the fields names in both tables are named in the same way. The only selecting we’ll have to do is to make sure all fields from the Right data source (in this case, Truck Metrics) are deselected (Figure 3). This allows that, in the situation of matched Truck Numbers and field names, the updated values will be used.
Figure 3: The Joined fields indicate the fields that exist in both Truck Mileage and Truck Metrics. Fields in yellow (Truck Mileage) are selected to reflect updated values in downstream tools. Fields in blue (Truck Metrics) are deselected.
Note that any unmatched fields have fallen out of the Right side of the Join:
To add them back into the data stream, simply Union the Center and Right Joins together, setting the tool to “Auto Configure by Name”. Then, to rearrange the data into its original format, use the Cross Tab tool (Figure 4). And, voila! Your data is updated!
Figure 4: The updated fields are highlighted in yellow. All other fields originally included in Truck Metrics are included the in the final results as well.
Question I have a table of sales data with each column being a week's worth of sales. I only want records that have data in each of those fields and want to filter out all records that have Null values. How can I do this?
Answer There are two basic elements necessary to make this happen. The first is that all records in the original table have a unique ID. If you do not have a unique ID in your data, go ahead and add a Record ID Tool.
In the sample data you can see we will want data from Rows 1 and 6 while filtering out each of the other records because they contain null values.
From here we will use the Transpose Tool to pivot your data into 3 separate columns. In the transpose field choose your unique ID as the KEY FIELD and make sure all other records are selected as DATA FIELDS.
The result is that you will have your unique ID field, a field called [Name] which contains the names of each of the fields in your data, repeated for every unique ID in your original data, and a [Value] field which contains the individual values for each of the records for each of the columns in the original data.
Now we want to search for Nulls, and get a comprehensive list of the UniqueID values that do not contain Null values. Now is the time to bring in a Summarize tool and GroupBy your unique ID field, and then use the CountNull action.
The result is a list of how many nulls exist in each of your unique ID groups.
Next we can simply filter out the fields that have 0 null values in them and then use the unique IDs to join back to the original data, and pull only those records.
It's important to note here that because I'm only interested in the original fields I intentionally chose to deselect the unique ID and the Null Count fields from the output of the join so that I am left with only those records that have data in all of the weeks.
See the attached v10.5 workflow for an example of the approach above.
Occasionally you may see one of these errors from the Join Multiple tool. It is a result of Cartesian joins.
A Cartesian join is when you join every row of one table to every row of another table. You can also get one by joining every row of a table to every row of itself. A Cartesian join is very CPU intensive.
For example, if you have four files and each file has the same ID in it twice, that means it will join 2*2*2*2 times on the ID (the field on which you're joining is the key referenced in the error; in this example, it's Field1, and the duplicated value is 12345). The same can be caused by multiple nulls in each file.
After your data prep and investigation, and when you know your data are correct, your choices on how to handle Cartesian joins include:
Allow multidimensional joins: The multidimensional join will occur with no error or warning reported.
Warn on multidimensional joins of more than 16 records: A warning will be reported in the Results window that a multidimensional join has occurred.
Error on multidimensional joins of more than 16 records: An error will be reported in the Results window that a multidimensional join has occurred and downstream processing will stop.
This article is part of the Tool Mastery Series, a compilation of Knowledge Base contributions to introduce diverse working examples for Designer Tools. Here we’ll delve into uses of the Union Tool on our way to mastering the Alteryx Designer:
The Union Tool, the aptly named join category tool with DNA on it, accepts multiple input streams of data and combines them into one, unified, data stream. Whereas the Join Tool combines datasets horizontally (either by a record ID or record position), the Union Tool combines datasets vertically. Not unlike how two nucleic acid strands are unified to form the double helical DNA.
We know, great puns are in our DNA.
The Union Tool has a handful of great applications besides side-stitching punchlines, too. Check them out below:
Have common fields in multiple datasets? Stack them into a single stream with the Union Tool by field name, position, or with manual arrangement:
Don't worry - your datasets don't have to have identical. Any uncommon fields will be at the end of the table, with any fields that are not in a given dataset being populated with null values.
Creating different joins
The Alteryx Join Tool has 3 outputs:
These look like:
If you’re used to SQL joins, these are the left, inner, and right joins, respectively. The Union Tool allows you to effortlessly combine these Join outputs (shaded areas above) to create other, more complex, SQL join configurations like the ones below:
Combining reporting elements vertically
Simply take your reporting elements and specify an Output Order in the Union Tool to stack them vertically - without creating a single reporting element from the combination like the Layout Tool does:
Detouring in apps/macros with help from Tool Containers
See the attached v10.6 workflow, Union.yxmd, for the stack, join, and reporting examples shown above!
By now, you should have expert-level proficiency with the Union Tool! If you can think of a use case we left out, feel free to use the comments section below! Consider yourself a Tool Master already? Let us know at email@example.com if you’d like your creative tool uses to be featured in the Tool Mastery Series.
Stay tuned with our latest posts every Tool Tuesday by following Alteryx on Twitter! If you want to master all the Designer tools, consider subscribing for email notifications.
Unlike a snowflake, it is actually possible for duplicates exist when it comes to data. To distinguish whether or not a record in your data is unique or a duplicate we have an awesome tool called the Unique Tool that will actually turn your data into a unique snowflake.
The Join Tool is the quintessential tool for data blending within Alteryx. As such, it is also one of the most widely used tools. The Join Tool allows you to join data together from two different sources in two different ways: by record position and by specific fields. Selecting by record position will attach the two datasets together where it will match up each record by the position it is in. Thus record 1 of the left dataset will be in the same row as record 1 on the right in the J output and so on. If one dataset from either side has more records than the other those records will not be joined and they will be placed in there corresponding right or left output (L or R). Joining by specific field will match records up based on a specific field or multiple fields. This article goes into how that option works in more depth and detail. I highly recommend it as a read, as it covers frequent behaviors of the tool that you might run into.
I feel like this is a simple answer and I kind of feel like I know the nuances between when to use the tool. I'm finding it hard to articulate in which use cases one is better than the other. Can anyone think of when would best to use append instead of join?
Answer To use the Join tool, ideally you'll want to have a field in both datasets that match. Either an ID of some sort, or a combination of fields. The Join tool allows you to add data to your dataset only where it matches based on the common field.
There is an option within the Join tool to join on record position. This acts sort of like an append where the first record from each input is matched regardless of the data within it, however it will only match the records to the lowest record count from the Inputs. For example, if one input anchor has 100 records and the other has 1,000, only the first 100 records will Join using this method.
The Append Fields tool adds all of the information from the S input (Source) to the records in the T input (Target). If you're not careful, you can blow out your data into a huge dataset. If you have 100 records on the Source side that you are appending to 1,000 records on the Target side you will end up with 100,000 records because each of the 100 Source records gets added to each of the 1,000 Target records.
To help with this issue, there is a setting at the bottom of the configuration for the Append Fields tool allowing you to decide what to do if the append creates more than 16 records: Error, Warn, Allow All Appends.
This article is part of the CS Macro Development Series. The goal of this series is to communicate tips, tricks, and the thought process that goes into developing good, dynamic macros.
The Detour Tool and its counterpart, the Detour End Tool, are tools that come in handy in building out custom Analytical Apps and Macro workflows when you want to “turn on” or “turn off” entire sections of the workflows based on a user input. While handy, there is an alternative to the approach in using Tool Containers to encapsulate the sections that you’d like to turn on/off and using a Radio Button (or other Interface Tools) and action to “Enable/Disable Container from Condition.” There are other action types that are also useful if you’d like to implement more logic to the enable/disable approach. As long as you conjoin the outputs of each Tool Container to a Union Tool none of your data streams require records to be output, successfully completing your bypass!
Attached is a short v10.5 example of the approach, using Radio Buttons, and the “Update Value with Formula” action to update the “Disabled” attribute of Tool Containers:
How Do I Replicate the WHERE EXISTS Functionality of SQL within Alteryx?
Example 1: UNION - Add rows from Table #1 to Table #2 if the key value of Table #1 does not exist in Table #2. If desired, the combined data set, could be joined with a third data set, only if the key value in #3 does not exist with only one key column.
Example 2: SELECT/UPDATE records from Table #1 based on the contents of Table #2. The statement below generates the names of customers who had orders during 2016.
select c1.customer_number ,c1.customer_name from customers c1 where 1 = 1 and exists (select * from customer_orders c2 where 1 = 1 and c1.customer_number = c2.customer_number and c2.order_year = 2016 ) ;
The SQL EXISTS condition is used in combination with a subquery and is considered to be met, if the subquery returns at least one row. It can be used in a SELECT, INSERT, UPDATE, or DELETE statement.
The syntax for the EXISTS condition in SQL is: WHERE EXISTS (subquery).
The subquery is a SELECT statement. If the subquery returns at least one record in its result set, the EXISTS clause will evaluate to true and the EXISTS condition will be met. If the subquery does not return any records, the EXISTS clause will evaluate to false and the EXISTS condition will not be met.
To do this in Alteryx (see attached workflow):
Scenario 1: Take the unmatched records from the source you are appending data from out of a join between the two data sets and union it back to the data set you are appending to.
Scenario 2: Create your "look up list" out of a filter for whatever you set as the condition. In this example, a year that is in the data. Join the filtered data back with the other data source. Deselect the data that comes from the filtered source.
Example attached in the v11.3 workflow Where Exists Question.yxmd.