This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
There's a lot of unstructured data out there, but don't worry. The Alteryx Intelligence Suite Text Mining tools can help.
Taken from Giphy and created with online-image-editorWhen you realize the breadth and variety of possibilities Alteryx offers you with Text Mining.
The first process that we will review will be identifying common topics, for which we will use three tools:
Topic modeling is a type of statistical model that scans documents and identifies word use patterns and groups them into topics. Topic models help organize and provide information to understand a large collection of unstructured text, help analysts make sense of document collections (known as a corpus in the world of natural language processing) by identifying topics and organizing texts in groups.
To carry it out, the first step is:
Before you start, it is good practice to clean up the text using the Data Cleansing tool to remove leading and trailing spaces, numbers, punctuation marks and any unwanted whitespace, and change all text to lowercase.
This tool presents us with several options to prepare the data before making the identification of topics.
The first step is to choose the language between:
In this example, we will process the [Text] field.
Next, we select if we want to apply stemming (reducing a word to its word stem).
Lemmatization is the process that standardizes the text and converts the words to their root to facilitate their grouping and analysis. For example:
Later we apply filters to the digits and punctuation marks.
The last step is to remove the stop words. These are words that are filtered before processing the text for analysis but, although these refer to the most common words of the language (such as prepositions and pronouns that are frequent but do not contribute to the meaning of the text), they only provide grammatical validity. It is not possible to filter all "meaningless" words; and all the ones that are removed are not necessarily useless. The SpaCy library is used.
Due to the complexity of the language, there are even words that can be useful as depending on the objective that is sought.
If we take for example movie names, we could filter It but we would leave out a title or if we remove numbers we could stop considering books like George
If a word sneaks into your analysis it means that it is not included in the list of stop words in the library; that is where you can add it manually to filter it. If there are several you can separate them with commas.
Another option, introduced in the 2021.1 release is the ability to pass stop words directly from a data source or text file to make it easier and eliminate any manual input. This process will yield a new field with the text preprocessed and ready to use for topic modeling.
In Part 2 I will discuss how to begin with the topic modeling process.