Welcome to Part Two of our introduction to Text Pre-processing in Alteryx Designer. To learn about tokenization and stopwords, check out Part One!
Language is a beautiful thing. Often there are many ways to express an idea, and language gives us the ability to subtly bend and shape the meanings of what we communicate. Unfortunately, for a computer, the flexibility and ambiguity of human language can be a nightmare.
Think about how we communicate with computers using programming languages; often each piece of information (e.g., a variable or function) can only mean one thing. Programming languages are intentionally more explicit with meaning than natural language so as to remove ambiguity and confusion.
Now, think about human language we often slightly modify words to make them sound right based on the context—sometimes we change words to describe a quantity (e.g., cat vs. cats) or gender (e.g., her vs. him) or ownership (e.g., they vs. theirs). This is very confusing for a computer. In fact, for many algorithms, such as LDA, each unique set of characters that form a string are treated as entirely independent, no matter how similar they are.
One of the best things can do to help computers understand human language is convert our tokens to consistent representations. This is a type of text normalization, and it is exactly what stemming and lemmatization attempt to do.
via GIPHY Your computer appreciates that the plural of moose is moose.
Lemmatization converts words to their dictionary form, so words like “running,” “runs,” “ran,” and “run” all become the lemma “run.” You can implement lemmatization in the Text Pre-processing tool by checking the Convert to Word Root (Lemmatize) option under Text Normalization.
Stemming is a related concept that simply chops off the suffixes (ends) of words to try to get them to a consistent representation, so for example, “running” and “runs” would both become “run." Another example is that “tradition” and “traditional” become a consistent representation “tradit.” Stemming is computationally much quicker than lemmatization but returns less-human-readable results.
The Text Pre-processing tool only performs lemmatization. This is because the underlying package SpaCy is opinionated and attempts to prevent redundancy in functionality. In the eyes of the creators of SpaCy, lemmatization is better that stemming, so lemmatization is in the package, while stemming is not.
One thing that might catch you off-guard with lemmatization is that it does treat different parts of speech differently. So while “directing,” “directs,” and “direct” all share the lemma “direct” (and are converted to “direct” when lemmatized), “directly” is not be converted to the lemma “direct” because it is an adverb and does not share the same lemma as the verb “direct.”
A common use case in cleansing text data is harmonizing multiple representations of a word into a single form. You might be thinking, “yeah, lemmatization, we talked about that,” and you’re totally right. This is a same-same-but-different use case.
Let’s think about abbreviations. Let’s start with ROI — return on investment. You might have a body of text that refers to the concept of ROI both as “ROI” and the fully written out “return on investment”. This is the same idea being expressed in two different ways, and a computer won’t be able to identify them as a single concept without a little bit of help from you. You can use a Find Replace tool for this use case. This is particularly handy because you can specifically define your list of strings to replace.
So if we have text like this:
Text |
NLP is a subfield of linguistics, computer science, information engineering, and AI. |
One NLP algorithm that's pretty neat is LDA. |
And a look-up list like this:
Acronym |
Replace |
NLP |
Natural Language Processing |
AI |
Artificial Intelligence |
LDA |
Latent Dirichlet Allocation |
We can use the Find Replace tool:
And get our “normalized” output:
Text |
Natural Language Processing is a subfield of linguistics, computer science, information engineering, and Artificial Intelligence. |
One Natural Language Processing algorithm that's pretty neat is Latent Dirichlet Allocation. |
Another application similar to this is replacing HTML tags from scraped text or character encodings that don’t translate quite the way you would have hoped. If you have a look-up list of the encoding translations, you’re made in the shade (I’ve attached one for HTML and one for UTF-8 to help you get started).
Another type of text normalization that is very specifically not included in the Text Pre-processing tool is the ability to force capitalization or lower casing for all characters in your text. It is a common step in text-data preparation to force all letters to lowercase, because a computer can not recognize tokens like “PARTY,” “party,” and “Party” as the same word.
The reason the ability to control character casing is not included in the Text Pre-processing tool is because we have many other fabulous tools that already do this — namely the Data Cleansing and Formula tools.
In the Data Cleansing tool, just check the Modify Case option at the bottom of the configuration panel and then specify the case type you’d like to see (UPPER CASE, lower case, or Title Case).
In the Formula tool, you can use the functions Uppercase(), Lowercase(), or Titlecase() for the same outcome 😊.
So concludes are very brief two-part introduction to Text Pre-Processing in Designer. In my mind and heart (and please note that I am NOT a product manager, and I am NOT discussing our roadmap), this is only the beginning for the text-processing (and pre-processing) capabilities available in Designer. The SpaCy package that the Text Pre-processing tool is based off of has a world of capabilities, including named entity recognition (NER) and part-of-speech (POS) tagging.
Believe it or not, this is the type of stuff that gets me out of bed in the morning (NLP and my hangry dog whining in my ear), and I’m excited to continue working **bleep** delivering these features in the Alteryx Platform. Hopefully, you’re as excited about the subtle beauty of text pre-processing (and the larger world of natural language processing) as I am.
A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. She currently manages a team of data scientists that bring new innovations to the Alteryx Platform.
A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. She currently manages a team of data scientists that bring new innovations to the Alteryx Platform.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.