on 11-28-2022 03:23 PM - edited on 07-28-2023 07:36 AM by tristank
Text parsing is a common data cleaning task. You have a text field (like a column of addresses, for example) where you want to extract certain parts of the field (like the city) for further analysis. But how do you do that? This guide will walk you through several tools and many examples to help you with your text parsing questions.
Something to keep in mind is that often, no one will have the exact same data format and question as you. But you can learn strategies and tools from other Community members on this topic.
If you are new to text parsing or have little practice, then go ahead and watch the Interactive Lessons on Parsing Data. There are nine lessons in this category, and they cover several tools.
When working on a text parsing problem, a good strategy is to start with the simplest solution and work your way up in complexity if the most straightforward option does not work for your data. This guide is structured accordingly, starting with the simplest solutions and moving toward the more complex methods.
The most complex (and versatile) option for parsing strings is a Regular Expression (RegEx). There are advantages and disadvantages to jumping straight to RegEx because you know it can handle any problem. Check out this blog on whether to RegEx or Not to RegEx to understand how the same problem can have multiple solutions and the trade-offs that occur with each choice.
We will go in-depth into RegEx in the final section of the guide. But now, let’s begin our text parsing journey with the Text to Columns tool!
The text to Columns tool allows you to split a text field into multiple fields or rows by specifying a delimiter to split on. Below are some delimiters you can specify to split the field:
You can also specify the delimiter as any single character (like a period or a dash). The data below is from the tool example in Designer. The Address field is selected along with “,” as the delimiter. You can specify how many output columns you would like, and voila! The address field is parsed into its components.
To understand what you can do with the Text to Columns tool, check out the other tool configurations in the tool example and the Tool Mastery article.
Here are some examples where the Text to Column tool was used to parse data on different delimiters:
In this discussion forum, you can see two solutions: one using RegEx and another using the Text to Columns tool. Both solutions allow you to parse data separated by spaces into separate columns.
It’s important to note that the Text to Columns tool will only work for String fields. See this discussion where the recommendation was to use a Formula tool since the data type to parse was a Date field. For more on parsing DateTime fields, refer to the DateTime Guide.
The Text to Columns tool can also split data from one field into multiple rows. Here are some examples from the Designer discussion forum:
If the text data you want to parse does not have a consistent delimiter, then it’s time to move on to your next options!
As you may be aware, the Formula tool can do a lot. This includes parsing strings! The documentation on String Functions is a comprehensive list of all functions for working with strings, and many of them can be used for parsing.
You might notice that there are three RegEx functions in there: REGEX_CountMatches, REGEX_Match, and REGEX_Replace. We will cover these in the RegEx section.
Take a look at these examples for extracting parts of text fields:
Additionally, this discussion thread shows how Substring() can be used to get the text when it is always in the same position in a field and how to make the solution dynamic with REGEX_Replace() if needed.
Here are some examples of replacing text using the formula tool:
This section of the guide is focused on the Formula tool, but please note that you can use the Find Replace tool for replacing text as well. You can view the Find Replace tool mastery article here and another helpful KB on parsing with this tool here.
Sometimes, you will have a column of text that has a distinct format due to the origin of the data. It might be web-scraped data and contain XML tags, or it could be data from an API in JSON format. For these cases, you can leverage tools to parse the data into rows and columns format that is easier to work with in Alteryx.
If your data is in XML format, you can use the XML Parse tool to parse the data into individual fields. Below is what the data looks like before and after the XML parse tool in the tool example:
This discussion thread solution shows how parsing XML can be an iterative process as you move from element to element. And as you can see from this solution, these problems can require quite a few XML Parse tools:
You can use the JSON Parse tool to separate JavaScript Object Notation (JSON) text into a table schema for downstream processing. Here are the input and output from the tool in the example workflow, so you can get a sense of what you can accomplish:
Check out this discussion solution, which shows how you can use the Cross Tab tool(s) to get the output of the JSON Parse tool into a more workable format. And for a more complex JSON parsing problem, review the solution to this discussion thread.
What is RegEx? It’s short for Regular Expressions; these are text patterns used to find, match, replace and extract characters within text fields. RegEx is our most powerful tool for text parsing, as it offers the most flexibility.
RegEx is a tough topic—it is covered in the Alteryx Advanced Certification. If you have trouble with the RegEx tool, you are not alone! The rest of this guide is dedicated to walking through the tool, examples, and strategies to use when you are solving your most challenging text parsing problems.
For shorter lessons on learning how to use Regular Expressions in Alteryx, go through these interactive lessons:
For a longer (1 hour) instructor-led training, watch the Introduction to Regular Expressions video.
Once you have watched these videos, review the help documentation for the RegEx tool, and don’t forget to check out the Tool Example file in Designer.
We also have the RegEx Tool Mastery article, which breaks down some examples for each configuration of the tool. Additionally, there is a video about 20 Token RegEx Expressions For The Newbie to help you learn common expressions.
Regex101.com is a great website for testing and troubleshooting your regular expressions. It also explains the expression and has a quick reference section for common tokens and more.
Another external resource linked in the help documentation for RegEx is the RegEx Coach. This is an application that you can download to experiment with regular expressions interactively.
Below are RegEx problem solutions from the Designer discussion forum. Review these before posting your RegEx question, as someone may have asked a similar one already!
The replace configuration for the RegEx tool will replace the expression you searched for with a second expression. Here are some examples:
Parsing separates the expression into new columns. You can then set the Name, Type, and Size of the new columns. A new column is created in the Output Columns table. Below are some parsing examples:
The match option appends a column containing a number: 1 if the expression matched what you specified, 0 if it did not. Here are some discussion threads with RegEx matching:
The tokenize option splits the incoming data using a regular expression. This option works similarly to the Text To Columns tool, except instead of matching and removing what you do not want, you match for what you want to keep. You want to match to the whole token, and if you have a marked group, only that part is returned. To see this option in action, review this discussion thread: Tokenize Group of Words to Column.
Additionally, we have some blogs that dive deeper into RegEx problems. For a financial use case, you can read Verifying Tax IDs with REGEX. And for a tutorial on cleaning XML data, there’s this article: RegEx for the Win: XML Tag Clean-Up.
Stuck at an error in your parsing problem? Learn from the solutions to these errors that other users saw:
Do you have a discussion forum thread, blog, or other content on Community that has helped you when parsing text? Drop it in the comments below.
Great article thanks @MeganBowers ! Knowledge about text parsing never enough 👨🎓
@MeganBowers Thanks a ton.
All at one place, I have bookmarked this page for my future reference.
Would recommend everyone to go through in detail.
This really is an incredible resource @MeganBowers - thank you for the care and time that went into this.
awesome resource Megan. Highly recommended!!