This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Between the RegEx , Text To Columns , and XML Parse Tools , the Alteryx data artisan already has an exceptionally robust selection of tools to help parse uniquely delimited data. However, there are still some data sets so entangled in formatting that it’s labor intensive to parse even for them. Enter the Find and Replace Tool , which captures the ability to find your nightmarish parsing workflows and replace them with sweet color by number pictures. Just kidding, it finds bad jokes and replaces them with good ones. Seriously, though, you could do both if you wanted to because this tool has the capability to look up a table of any number of specified targets to find in your data and will replace them with a table of specified sources. With the help of a few quick configuration steps, this tool can simplify some parsing use cases significantly.
RegEx….it can be tough, but extremely useful when looking for ways of extracting information from a string. Regular Expressions are basically a code you can write to match a specific set of characters (it's a pattern matching syntax). It could be something as simple as finding the three digit number in the string “Eample123Eample” or something more complicated like using hexadecimal to select a certain range of characters. RegEx is something that can be used in Alteryx via the RegEx tool in the Parse tool set and some of the main uses for RegEx are replacing, matching, and parsing specific characters within a string. Replacing text is a very handy process when working with string fields, as it can be used to replace virtually any set of characters. In the attached example module you will find the following data in Text Input Tool: What I would like to do for this example is replace the portion of the string “file” and only that portion. The easiest was to do this using the RegEx tool is to follow these steps:
Make sure that the correct field is selected in the Field to Parse drop down.
Enter “file” as my RegEx in the Regular Expression text box.
Make sure that your Output Method is set to Replace.
Finally, enter a replacement text in the Replacement Text text box.
In this example, I choose “FileNumber_”. Once a browse is connected and the module is run, you can see that the output have replaced all of the character matching “file” with “FileNumber_”. Using RegEx to match is another process that can be very helpful for your data preparation or transformation process. Using the Match output method will assign a True or False value for you data based on the RegEx that you have written. I’m going to use the file initial data that was used above, but this time I was to match only the records that are “file1…”, “file2…”, and “file3…”. To do this, I have written a regular expression of: file  .+ . The breakdown of the RegEx script is as follows:
First, I am searching for “file” much like I did in the replace example.
Next, I am using brackets “”, to only look for the numbers 1, 2, or 3. Square brackets are used to match specific characters, just like an OR statement in a Formula Expression.
And lastly, “.+”. The period will look for any character and the plus will look for one or more additional characters. This covers the file extension in our data.
As you can see in the below output, only records that had a file name of file1, file2, or file3 have been returned as a True match, while file4 returns False: Parsing with the RegEx tool can be a little bit more complicated, but again it is an extremely useful process that will help in your data prep and transformation stages. The nice thing about parsing with the RegEx tool is that it also allows you to alter the fields upon output. Once you have a valid regular expression for parsing, you will be able to change the field name, the data type, and the size of the field. The trick to parsing using RegEx is that you have to create a group, and this is done by using parenthesis “()” to identify each group. You can use multiple group parse into multiple field, but for this example there will just be one group. Using the same initial file name field, we will use a regular expression to parse just the file name without an extension. For this I have used the following expression: ^(.+) . .+$ . Just like before, the breakdown is as follows:
Using a “^” in RegEx means to look at the start of the string. This portion “(.+)” is looking for my group. In this case, any number of characters. And since I added a “^” to the beginning, it will look at the beginning of the string.
The ".." portion will look for the period in between the file name and the extension. Using a slash before the period will tell RegEx that you are actually looking for a period (the slash is called an Escape character). Otherwise a period is a wild card, meaning it will return any character.
The "+$" is the last portion where we look for the extension. Using “.+” will look for any characters, which will account for all of the extensions, while the dollar sign is making sure that this is looking at the end of the string.
Here is the outcome of using the parse method to find just the file name: I hope this helps you get started with regular expressions! As always, if you have any questions please don't hesitate to email firstname.lastname@example.org. Special thanks to Mike Akey for putting together this post, until next time!
The RegEx tool is kind of like the Swiss Army Knife of parsing in Alteryx; there are a whole lot of ways you can use it to do things faster or more effectively, but even if you just use the blade it's still immensely useful. Sometimes that's all you need, but if you do take the time to figure out how to use a few other tools in that knife, you'll start to see that there isn't much you can't do with it .
If you're like me, you may find RegEx a bit intimidating. While it can be confusing, once you get a handle on it you'll see just how powerful it actually is. Below is a post from the Discussion boards that should make an introduction into RegEx less daunting. The full post is below, and I've attached the sample workflow to this post as well.
A lot of data prep questions I get asked about seem to come back to the same answer - "Use Regex!"
However, RegEx isn't exactly easy to understand for someone who has never come across it before.
This is how I started out, but by dissecting other peoples' uses of RegEx, I started to understand how RegEx works and now I seem to use it in pretty much every workflow I create (slight exaggeration).
So I decided to upload the attached workflow that has 12 use cases for RegEx for you to use, adapt, learn from and improve on. Please download the attached Alteryx workflow and explore what's going on and you'll soon be more than comfortable with RegEx too!
I've also put some useful RegEx links on the original blog for this workflow here: http://www.theinformationlab.co.uk/2016/10/27/regex-practice-using-alteryx/
Feel free to upload your use cases for RegEx on this discussion post, along with examples, and perhaps we can start to create a small repository for everyone - from RegEx newbies to those highly skilled in this complex but extremely useful tool.
The attached workflow is in v10.5. Refer to our RegEx Tool Mastery if you'd like to learn more about the tool and its many uses!
Business Problem: Data preparation is time consuming and inefficient. Analysis requires multiple formats and types of data to be brought together within a streamlined process. Data warehouses are often incompatible silos of information that are both difficult and impractical to integrate with other data sources. Actionable Results:
Quickly and easily parse complex data streams
Integrate data from various disparate sources
Create scalable and repeatable data cleansing workflows
Overview: Effective strategic analysis often involves the integration of several types of disparate data formats. In the past, this has been an arduous task, and could pose a daunting challenge to even the most experienced analysts. Alteryx simplifies data hygiene techniques by providing streamlined functionality as both an ETL and processing tool. With modules that are simple to customize and run, tasks such as data parsing become feasible in a short amount of time. Any type of data cleansing can be easily accomplished using Alteryx, and every workflow created becomes scalable and repeatable. This module demonstrates how virtually any data stream can manipulated and formatted with relative ease. Vertical: Applicable to any vertical Key Tools Used: Text to Columns,Transpose, Multi-Row Formula, Summarize (Group By )
How do you cut off a certain amount of characters from the right hand of a field?
Starting Values: Blue4509, Blue2034, Yellow2345, Orange2341
Ending Values: Blue, Yellow, Orange
There are many ways to accomplish this - below and attached (v10.5) find just 4 of the options.
Option 1: You could use Left(Data, (Length(Data) - 4)) This method works well if your numbers are all 4 digits.
Option 2A: Regular expression either in the RegEx Tool or using Regex_Replace in the Formula Tool. This option allows for a varied number of digits.
In the Formula Tool, your expression could be REGEX_Replace([Data],"(\l+)\d*","$1")
The \l is looking for lowercase letters
The + specifies that the lowercase letter(s) must appear at least once
The \d is looking for digits
The * specifies that the digits may or may not appear (or is there "zero or more times").
The brackets around the \l+ "mark" that part of the pattern as special
So if the input is "Blue450"
REGEX_REPLACE([Input],"(\l+)\d*","$1") will return "Blue".
Option 2B: If you changed the expression to "mark" the digits portion, you could separate the two marked portions
REGEX_REPLACE([Input],"(\l+)(\d*)","$1 light has a wavelength of at least $2 nanometers") which will return the phrase:
"Blue light has a wavelength of at least 450 nanometers"
It is important to note that this function is case-insensitive by default so a \u (uppercase) would give the same result. If you want the expression to be case sensitive, add a ",0" to the end of the expression.
Grouping = ()
Also, if you use the RegEx Tool (from the Parse toolbox) you can access a list of commonly used RegEx 'tags'...and don't forget to access the Alteryx help with F1. For fun: http://xkcd.com/208/
Option 3: Another RegEx option is to separate the letters from the digits. ([[:alpha:]]+)(\d.+)
[:alpha:] = any letter
\d = digit
+ = one or more times
() = marked group
This article was put together to resolve a common issue with cleansing your data as well as to show the use of tools and techniques that are not normally used for newer users. The goal of the article is to get newer users into these tools to open their creativity with the tool and hopefully take you to the next level!
In this use case, the data in the attached workflow is messy with capitalized strings all over the place. We want to format the data by removing some of the capitalization, but not all of it.
Note: If we wanted to make every first letter of the word capitalized we can use the Formula Tool and the TitleCase(String) function. This would make BEAR the WEIGHT - Bear The Weight. See the difference?
The tools that we will be using in this exercise is the Record ID, Text to Columns, RegEx, Formula, Tile, and Cross Tab Tools.
The exercise will show you the importance of using the Record ID Tool. The flexibility of the Text to Columns and RegEx Tools, the under-used Tile Tool, the creativity of the Formula Tool, and the not so scary Cross Tab tool when then data is configured properly.
We hope that these exercise and use cases open up your mind and the greatness of Alteryx!
Enjoy! The attached workflow is in version 10.5.
A great new feature in v10 is the ability to turn on profiling to see which tools are taking up most of the time in your module.
You can turn it on here in the Workflow Configuration in the Runtime section:
This setting will add a profile of the tool times (in descending order) to your output log:
Thanks to @ScottL for mentioning this at the Analytic Independence Roadshow in London, it's really useful for helping optimise workflows, and as he pointed out the screenshot shows off the speed of the regex engine too!
Let's say you have data like this and you'd like to remove the numbers at the end:
You could use Left(x, len), like this: left([Data], (length([Data]) - 4))
(This calculates the length of the field, subtracts 4 and then takes that many characters from the left. E.g.: length(Blue4509) = 8, 8-4 = 4, left(Blue4509,4) = Blue)
Or a combination of ReverseString(Str) and Substring(x, start, length) : ReverseString(substring(ReverseString([Data]),4,100))
(This reverses the string, uses a substring to start at the 5th position [positions are zero based], takes the next 100 characters, and then reverses it back. E.g.: reversestring(Blue4509) = 9054eulB, 5th position = e, next 100 characters = eulB, reversing it back results in Blue)
BUT what if your data looks like this:
Blue4509 Yellow2345 Orange231 Blue6754 Green596828
The above won't work anymore, but ReplaceChar(x, y, z) will: replacechar([Data], '0123456789','')
(ReplaceChar(x,y,z) returns the string [x] after replacing each occurrence of the character [y] with the character [z].)
Or Regex_Replace(string, pattern, replace,icase): regex_replace([Data],"\d","")
(Regex_Replace(string, pattern, replace,icase) returns the string after replacing the pattern with the replace. In this case, it replace all digits.)
Or the Regex Tool:
\d in regular expression identifies digits (aka numbers)
What if your data looks like this:
1Blue4509 2Yellow2345 3Orange231 4Blue6754 5Green596828
and you want to preserve the number at the beginning of the string but remove the ones at the end?
The $ character in regex denotes the end of the string, * identifies one or more instances of the preceding character, so \d*$ identifies one or more digits at the end of the string: regex_replace([Data], "\d*$","") or:
Also check out all the other functions on the Alteryx help page, POSIX Extended Regular Expression Syntax, more info on the Regex Tool, and the Regex Cheat Sheet on the community.
Errors from the R tool, and macros using R, such as the Association Analysis tool, can be a challenge to troubleshoot. Since R is very particular about naming conventions, and the R macros are very particular about delimiters, one easy way to preemptively limit these errors is to make sure your field names are R-friendly and compliant.
Best practices is to use only letters, numbers and the underscore. Any special characters will likely upset R. One hint is "Error: No valid fields were selected." This means that one of the Pipe-to-Rows tools embedded in the R macros could not appropriately parse the field names.
For example, the following errors came from the Association Analysis tool, where the field names contain the gamut from asterisks to ampersands:
Error: Association Analysis (4): The Action "select fields A" has an error: ParseError: Expected entity name for reference at line 7 and column 30 while processing "Action_UpdateRawXml"
Error: Association Analysis (4): The Action "select fields A" has an error: ParseError: Expected whitespace at line 13 and column 63 while processing "Action_UpdateRawXml"
Error: Association Analysis (3): The Action "select fields A" has an error: ParseError: Unterminated entity reference, 'test' at line 4 and column 30 while processing "Action_UpdateRawXml"
If you have a thousand fields with every special character, don't fret. Dynamic Rename and regex to the rescue!
Using the Rename Mode: Formula, select all your fields, and replace any non-word character with an underscore:
Crisis averted! Happy Alteryx-ing.
Question How do you convert a string with a $ sign into a number?
Answer 1. First remove the dollar sign and commas out of the string
2. Convert () to a negative sign "-"
3. ToNumber() function
There are many ways to do this in Alteryx, here are a few:
Formula Tool - ToNumber(Replace(Replace(Replace(Replace([Data], '$', ''), ',' , ''), ')' , ''), '(' , '-'))
Formula Tool - ToNumber(REGEX_Replace(REGEX_Replace([Data], "\$|,|\)", ""), "\(", "-"))
RegEx Tool #1 - Replace \$|,|\) with Replacement Text "blank", then
RegEx Tool #2 - Replace \( with Replacement Text "-"
Select Tool - Change Type to Fixed Decimal
See v10.5 workflow attached.
This article features a workflow (created in version 10.1) that will allow you to search for key words within a column of text strings. Potential applications for this might be to scan a Twitter feed, customer product feedback or Facebook comments for key words such as the mention of particular people, descriptive words or subject matters. For this example, the purpose will be to search famous science quotes for a select group of key words.
In the first section of the workflow, we are performing the following functions:
Reading in the data - The first table contains the text strings to be searched (quotes); the second table contains the key words that we want to search for.
Adding a Record Id in order to join the original text strings back into the data.
Using the Text to Columns tool, we will split the text strings into individual rows for each word within each text string.
Append each of the key words to each of the records/words resulting from the previous step.
In the next section of the workflow, we will:
Use the Multi-Field Formula tool to change all words in both the text strings and key words to uppercase as the FindString function used downstream is case sensitive.
Use the RegEx tool to remove non-word characters from the text string words (such as periods, commas, etc.) so that we can do a double-check on the FindString by using a length test. This will prevent the FindString function from returning a match of the key word 'know' to the text string 'knowledge'.
Match words using the Formula tool and the FindString function in conjuction with comparing the length of the key words to the text string words.
In the final section of the workflow, we will complete the steps of:
Summarizing the counts of each key word to each Record ID of the text strings.
Joining back in the original text strings.
Sorting the data back to its original state.
Transposing the data in order to perform a cross-tab count of the key words.
Performing a final summary count of the key words.
Things to consider:
Your text strings might contain non-word characters not accounted for in the RegEx in this example. Make adjustments to the RegEx as appropriate.
The workflow results can be used to make a Word Cloud
As with almost anything Alteryx, there is more than one way to do things. Try some variations and post anything you think is worth while!
Use the RegEx tool’s drop down list to make the building of the expression easier: . Any Single Character * Zero or More + One or More < Start of Word > End of Word ^ Beginning of Line $ End of Line Newline Tab s White Space Character Escape Special Character d Digit l Lower Case Letter u Upper Case Letter [[:alpha:]] Any Letter w Word Character [a-zA-Z_0-9] W Non-Word Character [^a-zA-Z_0-9] <w+> Entire Word [ ] Any One Character in the Set [^ ] Any One Character Not in the Set | Or ( ) Marked Group (?: ) Unmarked Group
When you use a function inside of another function in RegEx, the interior part gets evaluated first. For instance in the formula CharToInt(Right([F1], 3)) + CharToInt(ReplaceChar([F1], "Ext.", "1")) the interior Right() formula will run before the Character to integer formula. The Replace Character formula will run before the Character to integer.
The RegEx tool is faster than using a regular expression in a formula. It has the disadvantage of only looking in 1 field at a time, but alas, it will also be much faster if you’re only looking in the fields that you need to.