Data Science

SydneyF · ‎07-27-2020

In the 2020.2 release of Designer, we added natural language processing (NLP) capabilities in the new Alteryx Intelligence Suite. While working on these tools, we decided we would be morally remiss if we didn’t include a tool that could help our users clean up their text data and prepare it for analysis. This is why we created the Text Pre-processing tool.

Don’t tell any of the other Text Mining tools, but I think the Text Pre-processing tool is my favorite. It has a bright and shiny future full of possibilities and can play a critical role in your current text-mining processes. I cannot understate the importance of having well-formatted, accurate, and clean data, and the Text Pre-processing tool adds many common text-preparation capabilities to Designer.

Keeping that in mind, in this mini-blog-series I introduce common text pre-processing steps and how they can be implemented in Alteryx Designer (mostly with the Text Pre-processing tool, but also with some of the classic Designer tools we know and love).

Today, we'll be exploring the fundamental concept of tokenization, and how we can filter out tokens (like stopwords)!

Tokenization

Let’s start by taking a moment to think about what text data is. In Designer, the columns containing text data have a “string” data type. A string is a single sequence of characters, including spaces, digits, and letters. Designer thinks about a string as a single object. This isn’t consistent with how we think about and use language, where we deal with words or phrases as individual concepts, separated by spaces and punctuated by (wait for it…) punctuation. These words and phrases are referred to as tokens in the world of natural language processing (much like sneaking in vegetables in a spaghetti Bolognese, I like to sneak in vocabulary to my blog posts and then mask it with gifs).

A critical first step in any natural-language-processing (NLP) project is tokenization — splitting a string of characters into a list of tokens. This allows down-stream NLP algorithms to treat each word or phrase individually as opposed to dealing with every piece of text as a totally unique string of characters.

via GIPHY Your computer when you don't tokenize your text data.

Under the hood, tokenization is the very first thing that the Text Pre-processing tool is doing to your text data. As a part of this process, the tool also expands contractions into separate tokens — so hasn’t becomes “has” and “n’t” and I’ve becomes “I” and “‘ve”.

This means that with the default configuration, the output of the tool is a new field with the suffix “_processed,” with a space between all identified tokens (e.g., commas have spaces on both sides), including expanded contractions.

description	description_processed
A mix of blueberry, blackberry and cherry fruit can't completely mask the earthy, stemmy streak that runs through this wine. Reductive and tight, it will require substantial airing to open up.	A mix of blueberry , blackberry and cherry fruit ca n't completely mask the earthy , stemmy streak that runs through this wine . Reductive and tight , it will require substantial airing to open up .

Token Filtering

With our text split into a stream of tokens, we have the ability to start filtering out any tokens that we might not find helpful for out application. In the Text Pre-processing tool, we currently have the option to filter out digit, punctuation, and stop-word tokens (we address stopwords in the next section).

Digit tokens are individual tokens that only contain digit characters. So “42” is flagged as a digit character and “10:30” is not.

This is the same for punctuation tokens. Individual tokens that only contain punctuation characters, like “.” or “,” or “---” are filtered out, but tokens that contain punctuation like “B3-18” are not.

This is intentional behavior, but if you prefer to remove all digit or all punctuation characters, it is straight-forward to do so with the Data Cleansing, Formula, and Regex tools.

In the Data Cleansing tool, you can check whichever characters you’d like to drop under the Remove Unwanted Characters Section.

Check Numbers to drop digit characters, and Punctuation to drop all of the punctuation. It is worth noting here that although the Data Cleansing tool considers currency characters like "$" to be punctuation, the Text Pre-Processing tool does not (they are classified as "currency"), and will not filter them out even if they exist as their own tokens.

If you feel like being fancy, you can write a REGEX_replace() function in the formula tool to replace all punctuation or digit characters with nothing, or you can use the Regex tool in Replace mode.

Stopwords (More Token Filtering)

Stopwords are words that don’t necessarily add meaning to a body of text but are necessary for the text to be grammatically correct. Think about words like “the” or “a” — these words aren’t providing much information, but we use them ALL THE TIME.

For algorithms like LDA (Latent Dirichlet Allocation) where individual words are sorted into topics, stopwords aren’t going to be very helpful or meaningful for defining a topic or sorting a document into a topic. Similarly, if we are doing simple word counts, or trying to visualize our text with a word cloud, stopwords are some of the most frequently occurring words but don’t really tell us anything. We’re often better off tossing the stopwords out of the text. By checking the Filter Stopwords option in the Text Pre-processing tool, you can automatically filter these words out.

The tool automatically filters out default stopwords based on the specified language. Here you can find the lists of default stopwords by language:

If your text contains stopwords that are not included in the default stopwords list, you can manually filter them out by entering them in the tool as a comma-separated list.

In the next post, we will be discussing text-normalization techniques in Alteryx. Stay tuned!

Data Science

Tokenization and Filtering Stopwords with the Text Pre-Processing Tool

Tokenization

Token Filtering

Stopwords (More Token Filtering)