A topic model is a type of a statistical model that sweeps through documents and identifies patterns of word usage, and then clusters those words into topics. Topic models help organize and offer insights for understanding large collection of unstructured text. helping analysts make sense of collections of documents (known as corpuses in the NLP world) by identifying themes and organizing the texts into groups.
In the 2020.2 release, we added the Topic Modeling tool to Designer as a part of the Alteryx Intelligence Suite (AIS). It is a powerful tool but requires some background knowledge to use it to its full potential. In this blog, I provide a gentle introduction to topic modeling and the new topic modeling tool in Alteryx.
Many different topic modeling algorithms exist. Latent Dirichlet Allocation (often abbreviated to LDA) is one of the most popular topic modeling algorithms currently in use. The specific algorithm that the Topic Modeling tool in Alteryx uses is the Sci-Kit Learn implementation of LDA.
Basically, LDA is a type of fuzzy clustering algorithm, meaning that it classifies words into overlapping topics. The algorithm "clusters" because it performs the classification on its own without the user specifying the features, and the clustering is "fuzzy" because the groups of words that make up topics can overlap. After the algorithm performs that fuzzy clustering, it determines what documents contain what topics and then groups documents that contain similar topics.
How Does LDA Work?
As a starting point, I’ve found it helpful to think about the way that the algorithm assumes documents are written. LDA assumes that when a document is written it is done by plugging in the length that the document should be (e.g., 100 words) and the break down of topics that should be contained in the document (e.g., 50% Alteryx and 50% Puppies). LDA assumes that there are “recipes” for each topic that can then be used to fill out the document. According to LDA, a topic is a group of words with different probabilities associated with them, where the probability describes the likelihood that a certain word can be selected from the topic.
The topics might look something like this.
Puppy |
Kitten |
Pie |
Baking |
||||
Puppies |
0.3 |
Kitty |
0.4 |
Crust |
0.25 |
Flour |
0.4 |
Woof |
0.3 |
Meow |
0.3 |
Pie |
0.25 |
Oven |
0.3 |
Dog |
0.2 |
Kitten |
0.2 |
Pecan |
0.25 |
Sugar |
0.2 |
Bark |
0.2 |
Cat |
0.1 |
Apple |
0.25 |
Pie |
0.1 |
For example, say that we are going to write a 10-word document where 70% of the words are about puppies and 30% are about pie.
After we determine the inputs, we can generate the document by selecting 7 words from the puppy-topic recipe and 3 words from the pie topic recipe. By using the probabilities of the words to select what words get generated, the final document might look like this:
Puppies Crust Woof Puppies Bark Woof Pie Apple Dog
That, in a nutshell, is how LDA “thinks” documents are written. Words are generated by topic according to associated probabilities (e.g., crust has a ¼ probability of being chosen each time the pie topic is asked to generate a word), and documents are written of a mixture of topics using the topic recipes to fill them out.
With this framework of how LDA "thinks" in mind, we can now think about how LDA identifies topics in a collection of documents. LDA attempts to work backward from its assumed process with a collection of documents to infer what those original topic recipes are.
To work backward, LDA starts by randomly assigning all of the words in a dictionary to each of the topics. Think of this as the starting position. The topic model isn’t sure which words belong to which topic, so it makes up a bunch of groups of words that it can go back to and correct as it gleans more information.
LDA reads through each document and moves words around between the topics until the documents that make up the corpus match the recipes, while assuming that each document contains a mixture of a few topics (how many topics and their composition is based on the hyperparameter alpha, discussed in the next post in the series).
LDA (and topic models in general) is an unsupervised algorithm. This means that we don’t need to give it training data that contains pre-defined topics; we just provide a collection of documents and have the topic model figure it out on its own. Although this is nice, because we don’t need to go through the effort of hand-labeling a set of data for the algorithm to learn from, it also means that we can’t guarantee the outputs of the topic model — they might line up with what we are looking for, but they might not.
Topic Modeling algorithms create collections of words that the algorithms think are related based on patterns in the corpus, but the algorithms do not explain what the collections mean — this is the human interpretation part of using topic modeling, and it can be messy. It is important to note that documents assigned to a topic do not necessarily contain all of the words included in that topic.
LDA is also what is known as a “bag of words” model. This is a term used to describe an approach in NLP where the context of a word or grammar is not considered — documents are just random combinations of word. This means that each word is assumed to only have one meaning (e.g., the word “bank” could describe the bank of a river or a financial institution, but the model treats all instances of the word “bank” as the same and doesn’t have a way to differentiate them) and different words are not expected to be related in any way (e.g., LDA treats “run” and “running” as entirely independent and unrelated terms). This is why text pre-processing steps like lemmatization are important for creating meaningful and effective topic models.
Hopefully this gives you an intuition on how Topic Modeling with LDA works. In the next post, we will discuss what the configuration options in the new Topic Modeling tool do, and how you can leverage those configuration options to make better topic models.
A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. She currently manages a team of data scientists that bring new innovations to the Alteryx Platform.
A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. She currently manages a team of data scientists that bring new innovations to the Alteryx Platform.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.