Data Science

Garabujo7 · ‎09-13-2021

There's a lot of unstructured data out there, but don't worry. The Alteryx Intelligence Suite Text Mining tools can help.

Taken from Giphy and created with online-image-editorWhen you realize the breadth and variety of possibilities Alteryx offers you with Text Mining.

The first process that we will review will be identifying common topics, for which we will use three tools:

Text Pre-processing
Topic modeling
Word cloud

Topic modeling is a type of statistical model that scans documents and identifies word use patterns and groups them into topics. Topic models help organize and provide information to understand a large collection of unstructured text, help analysts make sense of document collections (known as a corpus in the world of natural language processing) by identifying topics and organizing texts in groups.

To carry it out, the first step is:

Pre-data Cleaning

Before you start, it is good practice to clean up the text using the Data Cleansing tool to remove leading and trailing spaces, numbers, punctuation marks and any unwanted whitespace, and change all text to lowercase.

Text Pre-processing

This tool presents us with several options to prepare the data before making the identification of topics.

The first step is to choose the language between:

English
German
Spanish
French
Portuguese
Italian

In this example, we will process the [Text] field.

Next, we select if we want to apply stemming (reducing a word to its word stem).

Lemmatization

Lemmatization is the process that standardizes the text and converts the words to their root to facilitate their grouping and analysis. For example:

running, I ran, we ran, ran, run, run

The lemmatization would be "run." For a deeper dive, see Text Normalization in Alteryx.

Tokenization

Another important process that occurs in pre-processing is the tokenization; that is, splitting a phrase, sentence, etc. into smaller units so that the text analysis process treats each word segment as an independent element. For a deeper dive, see Tokenization and Filtering Stopwords with the Text Pre-Processing Tool.

Later we apply filters to the digits and punctuation marks.

Empty words

The last step is to remove the stop words. These are words that are filtered before processing the text for analysis but, although these refer to the most common words of the language (such as prepositions and pronouns that are frequent but do not contribute to the meaning of the text), they only provide grammatical validity. It is not possible to filter all "meaningless" words; and all the ones that are removed are not necessarily useless. The SpaCy library is used.

Due to the complexity of the language, there are even words that can be useful as depending on the objective that is sought.

If we take for example movie names, we could filter It but we would leave out a title or if we remove numbers we could stop considering books like George

Orwell’s 1984 to name a few.

Additional stop words

If a word sneaks into your analysis it means that it is not included in the list of stop words in the library; that is where you can add it manually to filter it. If there are several you can separate them with commas.

Another option, introduced in the 2021.1 release is the ability to pass stop words directly from a data source or text file to make it easier and eliminate any manual input. This process will yield a new field with the text preprocessed and ready to use for topic modeling.

In Part 2 I will discuss how to begin with the topic modeling process.

Banner image by kalhh

mceleavey · ‎09-13-2021

Hi @Garabujo7 , top stuff.

I may be getting ahead of you (it's only part one!) but could you comment about the filtering of punctuation when using the sentence level modelling?

ie, is punctuation used to determine sentence structures?

Thanks,

M.

Garabujo7 · ‎09-13-2021

Hello @mceleavey ,

By the way, topic modeling works on the word level and here is a reference about the internal algorithm we use to do it: LDA:

It is one of the most popular topic modeling methods. Each document is made up of various words, and each topic also has various words belonging to it. The aim of LDA is to find topics a document belongs to, based on the words in it. Confused much? Here is an example to walk you through it.

What has changed in the new release is the possibility to add list of stop words from a file or catalog using the new input anchor of the text pre-processing

Also the generate phrases is only for word cloud not for topic identification: