This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Text parsing is a common data cleaning task. You have a text field (like a column of addresses, for example) where you want to extract certain parts of the field (like the city) for further analysis. But how do you do that? This guide will walk you through several tools and many examples to help you with your text parsing questions.
Something to keep in mind is that often, no one will have the exact same data format and question as you. But you can learn strategies and tools from other Community members on this topic.
If you are new to text parsing or have little practice, then go ahead and watch the Interactive Lessons on Parsing Data. There are nine lessons in this category, and they cover several tools.
When working on a text parsing problem, a good strategy is to start with the simplest solution and work your way up in complexity if the most straightforward option does not work for your data. This guide is structured accordingly, starting with the simplest solutions and moving toward the more complex methods.
The most complex (and versatile) option for parsing strings is a Regular Expression (RegEx). There are advantages and disadvantages to jumping straight to RegEx because you know it can handle any problem. Check out this blog on whether to RegEx or Not to RegEx to understand how the same problem can have multiple solutions and the trade-offs that occur with each choice.
We will go in-depth into RegEx in the final section of the guide. But now, let’s begin our text parsing journey with the Text to Columns tool!
Parsing With the Text to Columns Tool
The text to Columns tool allows you to split a text field into multiple fields or rows by specifying a delimiter to split on. Below are some delimiters you can specify to split the field:
You can also specify the delimiter as any single character (like a period or a dash). The data below is from the tool example in Designer. The Address field is selected along with “,” as the delimiter. You can specify how many output columns you would like, and voila! The address field is parsed into its components.
To understand what you can do with the Text to Columns tool, check out the other tool configurations in the tool example and the Tool Mastery article.
Splitting to Columns Examples
Here are some examples where the Text to Column tool was used to parse data on different delimiters:
In this discussion forum, you can see two solutions: one using RegEx and another using the Text to Columns tool. Both solutions allow you to parse data separated by spaces into separate columns.
It’s important to note that the Text to Columns tool will only work for String fields. See this discussion where the recommendation was to use a Formula tool since the data type to parse was a Date field. For more on parsing DateTime fields, refer to the DateTime Guide.
Splitting to Rows Examples
The Text to Columns tool can also split data from one field into multiple rows. Here are some examples from the Designer discussion forum:
If the text data you want to parse does not have a consistent delimiter, then it’s time to move on to your next options!
Parsing With the Formula Tool
As you may be aware, the Formula tool can do a lot. This includes parsing strings! The documentation on String Functions is a comprehensive list of all functions for working with strings, and many of them can be used for parsing.
You might notice that there are three RegEx functions in there: REGEX_CountMatches, REGEX_Match, and REGEX_Replace. We will cover these in the RegEx section.
Take a look at these examples for extracting parts of text fields:
Additionally, this discussion thread shows how Substring() can be used to get the text when it is always in the same position in a field and how to make the solution dynamic with REGEX_Replace() if needed.
Here are some examples of replacing text using the formula tool:
This section of the guide is focused on the Formula tool, but please note that you can use the Find Replace tool for replacing text as well. You can view the Find Replace tool mastery article here and another helpful KB on parsing with this tool here.
Structured Text Parsing Tools
Sometimes, you will have a column of text that has a distinct format due to the origin of the data. It might be web-scraped data and contain XML tags, or it could be data from an API in JSON format. For these cases, you can leverage tools to parse the data into rows and columns format that is easier to work with in Alteryx.
If your data is in XML format, you can use the XML Parse tool to parse the data into individual fields. Below is what the data looks like before and after the XML parse tool in the tool example:
Check out this discussion solution, which shows how you can use the Cross Tab tool(s) to get the output of the JSON Parse tool into a more workable format. And for a more complex JSON parsing problem, review the solution to this discussion thread.
Parsing with RegEx
What is RegEx? It’s short for Regular Expressions; these are text patterns used to find, match, replace and extract characters within text fields. RegEx is our most powerful tool for text parsing, as it offers the most flexibility.
RegEx is a tough topic—it is covered in the Alteryx Advanced Certification. If you have trouble with the RegEx tool, you are not alone! The rest of this guide is dedicated to walking through the tool, examples, and strategies to use when you are solving your most challenging text parsing problems.
Overview of RegEx in Alteryx
For shorter lessons on learning how to use Regular Expressions in Alteryx, go through these interactive lessons:
The tokenize option splits the incoming data using a regular expression. This option works similarly to the Text To Columns tool, except instead of matching and removing what you do not want, you match for what you want to keep. You want to match to the whole token, and if you have a marked group, only that part is returned. To see this option in action, review this discussion thread: Tokenize Group of Words to Column.