Data Science

SydneyF · ‎08-03-2020

In the 2020.2 release, we added the Topic Modeling tool to Designer as a part of the Alteryx Intelligence Suite (AIS). It is a powerful tool but requires some background knowledge to use it to its full potential. In this series, I provide a gentle introduction to topic modeling and the new topic modeling tool in Alteryx. Missed Part 1 - What is LDA? Read it here.

Now that we know the algorithm driving the tool, we can move on to the tool basics...

Configuration

There are two minimum requirements for configuring the Topic Modeling tool. You need to specify the Text Field you’d like to create a topic model for, and you need to select the Number of Topics that you’d like to have the topic model generate. Those are the first two options in the configuration panel.

Knowing the “right” number of topics to generate for a corpus is tricky. It depends on your dataset and your application. Because LDA is an unsupervised algorithm (more on this in the limitations section), there is not a definitive way to evaluate the model’s performance and tune it accordingly, but there are a few metrics analysts can use, such as HDP-LDA (HDP-LDA links). Importantly, neither metric guarantees an interpretable topic model. For that reason, we recommend that you start with a number of topics that feels intuitive to your use case and data, generate a topic model, and then adjust the number of topics until you train a topic model that is interpretable.

The Output Options section allows you to change the output generated from the R anchor. Selecting Interactive Chart produces an interactive visualization of the model that you can view with a Browse tool. Word-Relevance Summary returns the words included in the topic model as well as Relevance and Saliency metrics. You can read more about both outputs in the following blog sections.

The Dictionary Options are implemented during the tokenization phase of the Topic Modeling tool, where the tool splits the text input into lists of individual words (i.e., tokens) and creates a master list of terms that the topic model sorts into topics. This is what we call a "dictionary" in the NLP world; it is a bank of words that make up the corpus being analyzed. The dictionary options allow you to limit what words are included in the dictionary, and therefore considered.

The Min Frequency option allows you to filter out rare words that only appear in a few documents in your corpus. The idea here is that words that do not occur in many documents won’t be helpful for identifying generalized topics. This option filters based on the percentage of documents a word occurs in, so if you set the value to 0.01, all words that occur in less than 1% of the documents are excluded by the topic model.

Max Frequency is similar but is used to filter out words that appear too frequently in the corpus. Like rare words, common words are not helpful for creating distinct topics; these are words like “the,” “and” and “or.” Although many of those words are removed by filtering stop words, filtering this way is helpful because it captures stop words that are specific to the corpus. If you set the value 0.8, all words that occur in 80% or more of the documents are excluded.

The final dictionary option, Max Words, puts a hard cap on how many words are included in the dictionary. After the filtering options are implemented, the dictionary takes the most frequent words and throws out the rest. This option can reduce the model’s training time.

Finally, we have the LDA hyperparameters. As we discussed in the How LDA Works? section, the topic model needs a few things to go off in order to work backwards from a set of documents. There are two major groups of “advanced options” in the Topic Modeling tool — options related to the dictionary generated for the model and hyperparameters for the algorithm itself. These are called priors, and they are the Alpha and Eta options in the configuration panel of the Topic Modeling tool. Alpha controls the assumed mixture of topics in each document. Eta controls the distribution of words per topic (eta is also referred to as beta in the literature, but because we are using the Sci-Kit Learn implementation of the algorithm, we use the term eta).

First, let’s talk about alpha. Do you remember when we said each document is a mixture of topics from the topic recipe book? Here is a quick visualization of four documents in a three-topic model.

The first document is all about penguins, the second document all about pies, the third document is half about puppies and half about pies, and the fourth document is equally about all three topics.

With alpha, we are describing what kinds of documents we think make up our corpus. Do our documents mostly contain one or two topics (points on the edges of the triangles), or do our documents contain three topics discussed in equal proportions? With an alpha close to 1, we are saying that documents can come from anywhere in our topic triangle — they are equally likely to be on the edges as they are in the middle. With an alpha less than 1, we are saying our documents are likely to be a more focused, and only contain one or two topics. With an alpha greater than 1, we are saying all of our documents are likely from the middle of the triangle.

https://stats.stackexchange.com/questions/244917/what-exactly-is-the-alpha-in-the-dirichlet-distribution

Eta works the same was as alpha but describes the prior for a word for each topic, rather than each document. A value greater than 1 indicates that topics tend to be made up of a large number of words, whereas a value less than 1 indicates that the topics tend to be more focused, having fewer, more dominant words.

If you’re feeling up to it, you can get deep into the weeds with priors here.

You can set alpha and eta to any positive value greater than 0 and up to infinity. You might think it is strange, then, that by default they are set to 0. Setting these values to 0 in the tool sets them to a default fraction, which is 1 divided by the number of topics you have configured the topic model to find. Typically, if the default values aren’t working for you, you can try setting the alpha and eta to lower values (e.g., 0.1 for alpha and 0.001 for eta) and see if you get more focused, coherent results.

Speaking of Results

The fun part has yet to come! The final post in this series will guide you through interpreting all the cool graphs in the Interactive Visualizations, and what those metrics mean to your results.

Data Science

Getting to the Point with Topic Modeling | Part 2 - How to Configure the Tool

Configuration

Speaking of Results