This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
This article features a workflow (created in version 10.1) that will allow you to search for key words within a column of text strings. Potential applications for this might be to scan a Twitter feed, customer product feedback or Facebook comments for key words such as the mention of particular people, descriptive words or subject matters. For this example, the purpose will be to search famous science quotes for a select group of key words.
In the first section of the workflow, we are performing the following functions:
Reading in the data - The first table contains the text strings to be searched (quotes); the second table contains the key words that we want to search for.
Adding a Record Id in order to join the original text strings back into the data.
Using the Text to Columns tool, we will split the text strings into individual rows for each word within each text string.
Append each of the key words to each of the records/words resulting from the previous step.
In the next section of the workflow, we will:
Use the Multi-Field Formula tool to change all words in both the text strings and key words to uppercase as the FindString function used downstream is case sensitive.
Use the RegEx tool to remove non-word characters from the text string words (such as periods, commas, etc.) so that we can do a double-check on the FindString by using a length test. This will prevent the FindString function from returning a match of the key word 'know' to the text string 'knowledge'.
Match words using the Formula tool and the FindString function in conjuction with comparing the length of the key words to the text string words.
In the final section of the workflow, we will complete the steps of:
Summarizing the counts of each key word to each Record ID of the text strings.
Joining back in the original text strings.
Sorting the data back to its original state.
Transposing the data in order to perform a cross-tab count of the key words.
Performing a final summary count of the key words.
Things to consider:
Your text strings might contain non-word characters not accounted for in the RegEx in this example. Make adjustments to the RegEx as appropriate.
The workflow results can be used to make a Word Cloud
As with almost anything Alteryx, there is more than one way to do things. Try some variations and post anything you think is worth while!
Between the RegEx , Text To Columns , and XML Parse Tools , the Alteryx data artisan already has an exceptionally robust selection of tools to help parse uniquely delimited data. However, there are still some data sets so entangled in formatting that it’s labor intensive to parse even for them. Enter the Find and Replace Tool , which captures the ability to find your nightmarish parsing workflows and replace them with sweet color by number pictures. Just kidding, it finds bad jokes and replaces them with good ones. Seriously, though, you could do both if you wanted to because this tool has the capability to look up a table of any number of specified targets to find in your data and will replace them with a table of specified sources. With the help of a few quick configuration steps, this tool can simplify some parsing use cases significantly.
How do you cut off a certain amount of characters from the right hand of a field?
Starting Values: Blue4509, Blue2034, Yellow2345, Orange2341
Ending Values: Blue, Yellow, Orange
There are many ways to accomplish this - below and attached (v10.5) find just 4 of the options.
Option 1: You could use Left(Data, (Length(Data) - 4)) This method works well if your numbers are all 4 digits.
Option 2A: Regular expression either in the RegEx Tool or using Regex_Replace in the Formula Tool. This option allows for a varied number of digits.
In the Formula Tool, your expression could be REGEX_Replace([Data],"(\l+)\d*","$1")
The \l is looking for lowercase letters
The + specifies that the lowercase letter(s) must appear at least once
The \d is looking for digits
The * specifies that the digits may or may not appear (or is there "zero or more times").
The brackets around the \l+ "mark" that part of the pattern as special
So if the input is "Blue450"
REGEX_REPLACE([Input],"(\l+)\d*","$1") will return "Blue".
Option 2B: If you changed the expression to "mark" the digits portion, you could separate the two marked portions
REGEX_REPLACE([Input],"(\l+)(\d*)","$1 light has a wavelength of at least $2 nanometers") which will return the phrase:
"Blue light has a wavelength of at least 450 nanometers"
It is important to note that this function is case-insensitive by default so a \u (uppercase) would give the same result. If you want the expression to be case sensitive, add a ",0" to the end of the expression.
Grouping = ()
Also, if you use the RegEx Tool (from the Parse toolbox) you can access a list of commonly used RegEx 'tags'...and don't forget to access the Alteryx help with F1. For fun: http://xkcd.com/208/
Option 3: Another RegEx option is to separate the letters from the digits. ([[:alpha:]]+)(\d.+)
[:alpha:] = any letter
\d = digit
+ = one or more times
() = marked group
This article was put together to resolve a common issue with cleansing your data as well as to show the use of tools and techniques that are not normally used for newer users. The goal of the article is to get newer users into these tools to open their creativity with the tool and hopefully take you to the next level!
In this use case, the data in the attached workflow is messy with capitalized strings all over the place. We want to format the data by removing some of the capitalization, but not all of it.
Note: If we wanted to make every first letter of the word capitalized we can use the Formula Tool and the TitleCase(String) function. This would make BEAR the WEIGHT - Bear The Weight. See the difference?
The tools that we will be using in this exercise is the Record ID, Text to Columns, RegEx, Formula, Tile, and Cross Tab Tools.
The exercise will show you the importance of using the Record ID Tool. The flexibility of the Text to Columns and RegEx Tools, the under-used Tile Tool, the creativity of the Formula Tool, and the not so scary Cross Tab tool when then data is configured properly.
We hope that these exercise and use cases open up your mind and the greatness of Alteryx!
Enjoy! The attached workflow is in version 10.5.
If you're like me, you may find RegEx a bit intimidating. While it can be confusing, once you get a handle on it you'll see just how powerful it actually is. Below is a post from the Discussion boards that should make an introduction into RegEx less daunting. The full post is below, and I've attached the sample workflow to this post as well.
A lot of data prep questions I get asked about seem to come back to the same answer - "Use Regex!"
However, RegEx isn't exactly easy to understand for someone who has never come across it before.
This is how I started out, but by dissecting other peoples' uses of RegEx, I started to understand how RegEx works and now I seem to use it in pretty much every workflow I create (slight exaggeration).
So I decided to upload the attached workflow that has 12 use cases for RegEx for you to use, adapt, learn from and improve on. Please download the attached Alteryx workflow and explore what's going on and you'll soon be more than comfortable with RegEx too!
I've also put some useful RegEx links on the original blog for this workflow here: http://www.theinformationlab.co.uk/2016/10/27/regex-practice-using-alteryx/
Feel free to upload your use cases for RegEx on this discussion post, along with examples, and perhaps we can start to create a small repository for everyone - from RegEx newbies to those highly skilled in this complex but extremely useful tool.
The attached workflow is in v10.5. Refer to our RegEx Tool Mastery if you'd like to learn more about the tool and its many uses!
The RegEx tool is kind of like the Swiss Army Knife of parsing in Alteryx; there are a whole lot of ways you can use it to do things faster or more effectively, but even if you just use the blade it's still immensely useful. Sometimes that's all you need, but if you do take the time to figure out how to use a few other tools in that knife, you'll start to see that there isn't much you can't do with it .
Let's say you have data like this and you'd like to remove the numbers at the end:
You could use Left(x, len), like this: left([Data], (length([Data]) - 4))
(This calculates the length of the field, subtracts 4 and then takes that many characters from the left. E.g.: length(Blue4509) = 8, 8-4 = 4, left(Blue4509,4) = Blue)
Or a combination of ReverseString(Str) and Substring(x, start, length) : ReverseString(substring(ReverseString([Data]),4,100))
(This reverses the string, uses a substring to start at the 5th position [positions are zero based], takes the next 100 characters, and then reverses it back. E.g.: reversestring(Blue4509) = 9054eulB, 5th position = e, next 100 characters = eulB, reversing it back results in Blue)
BUT what if your data looks like this:
Blue4509 Yellow2345 Orange231 Blue6754 Green596828
The above won't work anymore, but ReplaceChar(x, y, z) will: replacechar([Data], '0123456789','')
(ReplaceChar(x,y,z) returns the string [x] after replacing each occurrence of the character [y] with the character [z].)
Or Regex_Replace(string, pattern, replace,icase): regex_replace([Data],"\d","")
(Regex_Replace(string, pattern, replace,icase) returns the string after replacing the pattern with the replace. In this case, it replace all digits.)
Or the Regex Tool:
\d in regular expression identifies digits (aka numbers)
What if your data looks like this:
1Blue4509 2Yellow2345 3Orange231 4Blue6754 5Green596828
and you want to preserve the number at the beginning of the string but remove the ones at the end?
The $ character in regex denotes the end of the string, * identifies one or more instances of the preceding character, so \d*$ identifies one or more digits at the end of the string: regex_replace([Data], "\d*$","") or:
Also check out all the other functions on the Alteryx help page, POSIX Extended Regular Expression Syntax, more info on the Regex Tool, and the Regex Cheat Sheet on the community.
Question How do you convert a string with a $ sign into a number?
Answer 1. First remove the dollar sign and commas out of the string
2. Convert () to a negative sign "-"
3. ToNumber() function
There are many ways to do this in Alteryx, here are a few:
Formula Tool - ToNumber(Replace(Replace(Replace(Replace([Data], '$', ''), ',' , ''), ')' , ''), '(' , '-'))
Formula Tool - ToNumber(REGEX_Replace(REGEX_Replace([Data], "\$|,|\)", ""), "\(", "-"))
RegEx Tool #1 - Replace \$|,|\) with Replacement Text "blank", then
RegEx Tool #2 - Replace \( with Replacement Text "-"
Select Tool - Change Type to Fixed Decimal
See v10.5 workflow attached.
The RegEx tool is faster than using a regular expression in a formula. It has the disadvantage of only looking in 1 field at a time, but alas, it will also be much faster if you’re only looking in the fields that you need to.
Use the RegEx tool’s drop down list to make the building of the expression easier: . Any Single Character * Zero or More + One or More < Start of Word > End of Word ^ Beginning of Line $ End of Line Newline Tab s White Space Character Escape Special Character d Digit l Lower Case Letter u Upper Case Letter [[:alpha:]] Any Letter w Word Character [a-zA-Z_0-9] W Non-Word Character [^a-zA-Z_0-9] <w+> Entire Word [ ] Any One Character in the Set [^ ] Any One Character Not in the Set | Or ( ) Marked Group (?: ) Unmarked Group
When you use a function inside of another function in RegEx, the interior part gets evaluated first. For instance in the formula CharToInt(Right([F1], 3)) + CharToInt(ReplaceChar([F1], "Ext.", "1")) the interior Right() formula will run before the Character to integer formula. The Replace Character formula will run before the Character to integer.
Business Problem: Data preparation is time consuming and inefficient. Analysis requires multiple formats and types of data to be brought together within a streamlined process. Data warehouses are often incompatible silos of information that are both difficult and impractical to integrate with other data sources. Actionable Results:
Quickly and easily parse complex data streams
Integrate data from various disparate sources
Create scalable and repeatable data cleansing workflows
Overview: Effective strategic analysis often involves the integration of several types of disparate data formats. In the past, this has been an arduous task, and could pose a daunting challenge to even the most experienced analysts. Alteryx simplifies data hygiene techniques by providing streamlined functionality as both an ETL and processing tool. With modules that are simple to customize and run, tasks such as data parsing become feasible in a short amount of time. Any type of data cleansing can be easily accomplished using Alteryx, and every workflow created becomes scalable and repeatable. This module demonstrates how virtually any data stream can manipulated and formatted with relative ease. Vertical: Applicable to any vertical Key Tools Used: Text to Columns,Transpose, Multi-Row Formula, Summarize (Group By )
RegEx….it can be tough, but extremely useful when looking for ways of extracting information from a string. Regular Expressions are basically a code you can write to match a specific set of characters (it's a pattern matching syntax). It could be something as simple as finding the three-digit number in the string “Eample123Eample” or something more complicated like using hexadecimal to select a certain range of characters. RegEx is something that can be used in Alteryx via the RegEx tool in the Parse toolset and some of the main uses for RegEx are replacing, matching, and parsing specific characters within a string.