This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Sometimes you look at the steaming pile of data before you and wonder how you’ll ever get it in the form you need. Every option seems to require a great deal of manual labor, and as a lazy– er that is, as a data blending professional, that is simply something you will not abide.
Did you know the average football game lasts 3 hours and 12 minutes and only amounts to roughly 11 minutes of play? Now, I love trying to eat Doritos through my TV screen as much as the next guy, but for me the highlights are definitely a better watch. The Summarize Tool would probably agree - the most effective communication of your data is the most concise summary of it. Whether it’s concatenating strings for storage, merging reports to have better readability, getting your spatial objects to interact, or even calculating averages and other formulas on groupings of data, the Summarize Tool can reframe your data to be more informative. This article provides a few examples on how.
You have multiple fields in your data that correspond to customer addresses. Some customers may have more than one address listed in their record. However, you want to whittle that list to one address per customer. That one address is the first, non-null address found when moving sequentially across a set of fields that contain address data.
For our example, we have three fields of data related to addresses: Address1, Address2 and Address3. The preferred field of data to use is Address1. However, if Address1 does not contain data, then we’ll use the data in Address2. If both fields of Address1 and Address2 do not have data, then we’ll use the data in Address3. The final output should contain a single list of the addresses highlighted in yellow in Figure 1.
Figure 1: The final output will contain a single list of the data highlighted in yellow.
Method 1: Write a Conditional Statement
The most common approach to this type of problem is to craft a conditional statement to select data based on a set of criteria (in this case, order). In the case of our data, it would look something like this:
IF IsNull([Address1]) AND IsNull([Address2]) THEN [Address3]
ELSEIF IsNull([Address1]) THEN [Address2]
ELSE [Address1] ENDIF
However, what if I had 20 fields of addresses instead of 3? Just getting that statement with three fields took me too long to write out! If you do have 20 fields, you might want to start typing that Formula now….
IF IsNull([Address1] AND IsNull([Address2]) AND IsNull([Address3]) AND IsNull([Address4]) AND IsNull([Address5]) AND IsNull([Address6]) AND IsNull([Address7]) AND IsNull([Address8])...
You get the idea. And now you’re thinking, “You’re going to tell me there’s a better way, right?!?” Well, yes...I am!
Method 2: Data Manipulation
An alternative method of solving this problem is to manipulate the data using the Transpose, Filter and Sample tools. I’ll share some advice from @RodL here: “If you want to make something…truly dynamic, then the ‘best practice’ when you are dealing with an unknown number of columns is to ‘verticalize’ the data”. In our case, we may know the total number of columns of address data we have; what we don’t know is which column the data we want is actually in.
Following @RodL’s suggestion, we’ll ‘verticalize’ the addresses using the Transpose tool. This stacks the addresses for each customer in order of the fields in the table (Figure 2). We’ll use the Client ID (or Record ID, if you’ve chosen to add one) as our Key Field and the fields that contain address data as our Data Fields.
Figure 2: All address fields per Record ID (or Client ID) are stacked vertically in order of the field sequence.
Since Null values are not usable records for our purposes, we’ll use a Filter to identify all the usable data (non-Null values). Now that our usable data is stacked vertically in order of field selection, we can Sample the first record from each Record ID (or Client ID) group. We’ll configure the Sample tool to identify the First N (where N = 1) Records from a group (Group by Record ID or Client ID).
Figure 3: Sample the first record from every Record ID or Client ID group.
After some data clean-up with a Select tool, we're left with a column of the selected address for each of our customers:
Is there a way to avoid the Cross Tab Tool from transferring all the input information alphabetically? Simply add a RecordID to your records and add the RecordID field as a grouping field in your Cross Tab Tool to keep the order!
You may have noticed that the Cross Tab tool will insert underscores in your headers. There may be times when this is ok, but for those times when it’s not, how do you get the original format of the data back - without the underscores? Attached is an example that provides a solution. Feel free to bypass the rest of this posting and go directly to the example.
If you'd like a narrative of what is going on in the example, read on. The Text Input tool has data that has the following data:
We want to cross tab our data, so fields in Dept Name become our new headers with Cost in Millions as our values.
So now our data looks like this:
where the headers have been changed to include underscores. Very uncool. So in a separate workflow stream, we're going to get the original format.
Attach a Select tool to the Text Input:
And select Program Type and Dept Name. We don't need the Cost in Millions anymore. We going to use the Dept Name - which is in the original format - and eventually use this to replace the data with underscores.
After the select, our data looks like this.
However, we only need one group for Dept Name. You can see it the Dept Name is repeated for each Program Type. So attach a Unique Tool and check Dept Name in the configuration window.
Now we're left with the following data:
This is where it gets interesting. We need to make a copy of the Dept Name, but we want to make it look exactly like the headers with the underscores in it. Why? We'll use this new field to align our data in sequent Union tool. In the meantime, back to creating the new field with underscores in the data.
This REGEX_Replace expression will take the current Dept Name field and replace the colons and spaces with underscores. In the real world, your actual data may contain other characters that were replaced with underscores by the Cross Tab tool. You'll need to modify this expression accordingly. In this example, the result of the REGEX_Replace statement looks like this:
where Dept Name2 looks like the headers from the Cross Tab tool.
So rather ironically we're going to cross tab this data. Connect a Cross Tab tool and in the configuration, we'll group by Program Type, make the new headers Dept Name2 and the values will be Dept Name:
The result looks like this:
We're ready to union our data at this point. Since the headers from both Cross Tab tools look the same, we can select 'Auto Configure by Name'.
Make sure the Cross Tab with the original format is attached to the Union Tool first.
By specifying the output order in the Union Tool configuration, we ensure the original field data is on top (rather than the bottom).
For the grand finale, we'll add a Dynamic Rename tool and configure the tool to 'Take Field Names from First Row of Data'. Notice that Program Type is not checked in the configuration. That's because this field is already in the format we want, so we unselect that field.
Now we have our data, cross-tabbed, with the original format. Our headers look the way we want them to!
We frequently get questions about how to flag rows in a data set that are missing values in any column.
In the example below, we will demonstrate how to add a new column to the data that gives a count of null or empty values per row. The end result will give a count of 1 to each of the highlighted rows in the image. The new column can be used in a filter tool to isolate rows of data that have missing values.
The Transpose tool pivots data for all selected fields. The column headers are listed in the name field and the corresponding data items are listed in the value field. You can also select key fields which will remain unchanged through the transformation. The transpose tool is often used in conjunction with the Cross Tab tool, which essentially works in the opposite direction.
Have you ever wanted to do a Cross Tab, but needed the results in a particular order? You've probably discovered that Cross Tab will dutifully change your data to column headings and keep the rest of the data correlated, but that the output columns will be arranged in alphabetical order. It's as if there were a Select tool with the "Sort on Field Name" Option thrown into your workflow without you putting it there.
The Dynamic Rename Tool is part of the developer category of tools. It allows the user to quickly rename any or all fields within an input stream by employing the use of different methods.
The user has the option to rename only certain fields, all fields, or even dynamic/unknown fields at runtime (e.g. after a Cross Tab Tool). The option for renaming fields are:
The Make Columns Tool takes rows of data and arranges them by wrapping records into multiple columns. The user can specify how many columns to create and whether they want records to layout horizontally or vertically.
Let's take a look at this data set that needs to be combined into one field for reporting purposes:
but it shouldn't end up looking like this:
That is no better for reporting purposes than the data we started with because we only want to see areas that need improvement and we don't want to just combine values separated by commas.
So how can we get it to look like this?
A formula such as this one could work:
If contains([ProdDessert], "Improve") then [ProdDessert] else "" endif + " & " + If contains([ProdSupplies], "Improve") then [ProdSupplies] else "" endif + " & " + If contains([ProdFood], "Improve") then [ProdFood]else "" endif + " & " + If contains([ProdFoodSvc], "Improve") then [ProdFoodSvc] else "" endif
But that's a lot of typing! And it will require cleaning up of "&" and spaces where they aren't needed.
So let's look at a dynamic way of doing this:
First, using a Transpose tool, the columns are turned into rows and the data is filtered for rows that contain the word "Improve":
Then, using a Summarize tool, the remaining rows are concatenated together using a space as the delimiter:
Lastly, some cleanup is done in a Formula tool:
And the data is ready for reporting!
Also, take a look at the attached workflow that walks through these steps.
This article was put together to resolve a common issue with cleansing your data as well as to show the use of tools and techniques that are not normally used for newer users. The goal of the article is to get newer users into these tools to open their creativity with the tool and hopefully take you to the next level!
In this use case, the data in the attached workflow is messy with capitalized strings all over the place. We want to format the data by removing some of the capitalization, but not all of it.
Note: If we wanted to make every first letter of the word capitalized we can use the Formula Tool and the TitleCase(String) function. This would make BEAR the WEIGHT - Bear The Weight. See the difference?
The tools that we will be using in this exercise is the Record ID, Text to Columns, RegEx, Formula, Tile, and Cross Tab Tools.
The exercise will show you the importance of using the Record ID Tool. The flexibility of the Text to Columns and RegEx Tools, the under-used Tile Tool, the creativity of the Formula Tool, and the not so scary Cross Tab tool when then data is configured properly.
We hope that these exercise and use cases open up your mind and the greatness of Alteryx!
See attached workflow and enjoy!
The Excel number for a modern date is most easily calculated as the number of days since 12/30/1899 on the Gregorian calendar.
Attached please find a v10.0 workflow that converts a double (Excel date) to a Datetime in Alteryx.
It uses this function in a Formula Tool:
where Datebase = a datetime value set to "1899-12-30 00:00:00"
and Field1 is double value = 42397
Question I have a table of sales data with each column being a week's worth of sales. I only want records that have data in each of those fields and want to filter out all records that have Null values. How can I do this?
Answer There are two basic elements necessary to make this happen. The first is that all records in the original table have a unique ID. If you do not have a unique ID in your data, go ahead and add a Record ID Tool.
In the sample data you can see we will want data from Rows 1 and 6 while filtering out each of the other records because they contain null values.
From here we will use the Transpose Tool to pivot your data into 3 separate columns. In the transpose field choose your unique ID as the KEY FIELD and make sure all other records are selected as DATA FIELDS.
The result is that you will have your unique ID field, a field called [Name] which contains the names of each of the fields in your data, repeated for every unique ID in your original data, and a [Value] field which contains the individual values for each of the records for each of the columns in the original data.
Now we want to search for Nulls, and get a comprehensive list of the UniqueID values that do not contain Null values. Now is the time to bring in a Summarize tool and GroupBy your unique ID field, and then use the CountNull action.
The result is a list of how many nulls exist in each of your unique ID groups.
Next we can simply filter out the fields that have 0 null values in them and then use the unique IDs to join back to the original data, and pull only those records.
It's important to note here that because I'm only interested in the original fields I intentionally chose to deselect the unique ID and the Null Count fields from the output of the join so that I am left with only those records that have data in all of the weeks.
See the attached v10.5 workflow for an example of the approach above.
Rather than develop complex processes to calculate cumulative totals, the Running Total tool can help you compute these calculations quickly and easily. What’s more is that this tool is not only useful for calculating accumulation in a dataset, but it has also proved effective in calculating running reductions, or withdrawals, in datasets as well.
To provide some context around the use of the Running Total tool, I’ve included a workflow with this article that I’ll refer to throughout this piece. In this particular example, items from a warehouse need to be distributed to different stores. The stock of each item is prioritized at each store. For added color, let’s pretend that the National Hockey League (NHL) is planning on where to stock Stanley Cup paraphernalia. The teams in the face-off? The Chicago Blackhawks and the Carolina Hurricanes (a fan can dream!). The NHL has ten items of swag to stock in 26 different stores across the country. Different items, however, are stocked in different stores. Item 1, Stanley Cup T-Shirts, will be stocked with top priority given to the stores in Chicago and Fayetteville, NC. The stores in Seattle and Houston are of the lowest priority; if there’s not enough stock, then these stores will not receive any T-Shirts. Follow along with the workflow in Alteryx as I run (no pun intended!) through the two uses of the Running Total tool for this example.
The Running Total tool is probably best known and most used to calculate cumulative sums in a field. Moving row by row, the tool adds the value of a cell to the sum of those above it. This tool can be configured to create running totals for groups or categories of data for one or multiple fields, which is useful for consolidating the number of calculations to include in the workflow. I use these capabilities in the “Running Forward” calculation in the blue box in the workflow. For this particular case of “Running Forward”, or accumulation, I calculate the running totals of each item that a warehouse supplies for two fields: the Total Required for the current shipment and the Total Amount Sold Last Quarter (see blue box in workflow). This type of calculation could be useful for the supplier to compare the differences in stock needs in the current shipment and the previous quarter. This type of information can be analyzed in a variety of ways, but I chose to insert a Multi-Row Formula and Summarize tool to investigate the percent differences per store per item in the number of items to be stocked and how many previously sold.
While the Running Total tool is an obvious choice for calculating accumulation, I’ve also used it as a component for calculating totals that “run backwards” to answer questions of when demand surpasses the amount of something in supply. Now, you’re probably scratching your head and thinking “Wait, this is the exact opposite of what the Running Total tool does!” Well, you’re right! But, using the Running Total tool in conjunction with a Multi-Row Formula tool provides the perfect reverse engineering scheme to accomplish “running reductions”.
The green box in the attached workflow demonstrates how the running total of the number of each item can be used as a withdrawal from the total amount available of each item. The cumulative totals generated by the Running Total tool are used as a sort of test against the total number of an available item. As an example of this process, look at the data coming out of the Running Total tool for Stanley Cup T-Shirts (Item 1). Does the cumulative total of the number of required T-Shirts exceed the number of T-Shirts in stock? To answer that question for T-Shirts and the rest of the items, I apply that logic with a conditional statement in a Multi-Row Formula. Stores whose required stock needs are not met are revealed after using a Filter tool. Looks like everyone gets the number of T-Shirts they need, but some stores will not receive their required number of Items 6, 7, 8 and 9. If those are the Stanley Cup parkas, my guess is that our friends in Texas and Florida won’t miss them too much, anyway.
So, where can you find this time-saver of a tool? The Running Total tool is located under the Transformation tool group and is best described using @danielkresina’s observation: it’s the tool whose icon has a guy running on it. Configure the tool on your canvas, run the workflow, and celebrate the ease of calculating running totals. Happy Data Blending!
Trying to convert all of your old, mundane Excel workbooks into Alteryx workflows? The Running Total Tool could be the key to your success! You know, it’s that tool in the Transform category with the little running man picture on it.