Data Science

Machine learning & data science for beginners and experts alike.
SydneyF
Alteryx Alumni (Retired)

In the 2020.2 release, we added the Topic Modeling tool to Designer as a part of the Alteryx Intelligence Suite (AIS). It is a powerful tool but requires some background knowledge to use it to its full potential. In this series, I provide a gentle introduction to topic modeling and the new topic modeling tool in Alteryx. Missed the first two in the series? Catch up with Part 1 | What is LDA? and Part 2 | How to Configure the Tool.

 

Now that we know the algorithm driving the tool and how to configure it, let's dive into the fun part...

 

 

Interpreting the Visualization

 

If you choose Interactive Chart in the Output Options section, the “R” (Report) anchor returns an interactive visualization of the topic model. The interactive visualization is a modified version of LDAvis, a visualization developed by Carson Sievert and Kenneth E. Shirley. The visualization has two major components, the bar chart on the left and the intertopic distance map on the right.

 

 

vizOutput1.png

 

 

The intertopic distance map is a visualization of the topics in a two-dimensional space . The area of these topic circles is proportional to the amount of words that belong to each topic across the dictionary. The circles are plotted using a multidimensional scaling algorithm (converts a bunch of dimension, more than we can conceive with our human brains, to a reasonable number of dimensions, like two) based on the words they comprise, so topics that are closer together have more words in common.

 

The bar chart by default shows the 30 most salient terms. The bars indicate the total frequency of the term across the entire corpus. Salient is a specific metric, defined at the bottom of the visualization, that can be thought of as a metric used to identify most informative or useful words for identifying topics in the entire collection of texts. Higher saliency values indicate that a word is more useful for identifying a specific topic. 

 

When you select a topic in the intertopic distance map, or specify a topic in the top panel, the bar chart changes to display the most salient words included in that specific topic. A second darker bar is also displayed over the term’s total frequency that shows the topic-specific frequency of words that belong to the selected topic. If the dark bar entirely eclipses the light bar, that term nearly exclusively belongs to the selected topic.

 

When you select a word in the bar chart, the topics and probabilities by topic of that word are displayed in the intertopic distance map, so you can see which other topics a term might be shared with.

 

 

2020-09-01_13-43-57.png

 

 

 

You can adjust the words displayed in the bar chart for a topic by adjusting the λ (lambda) slider. Adjusting lambda to values close to 0 highlights potentially rare but more exclusive terms for the selected topic. Larger lambda values (closer to 1) highlight more frequently occurring terms in the document that might not be exclusive to the topic. The authors of this visualization found in a user study that a λ value close to 0.6 was optimal for interpreting the topics, although they expected this value to change based on the data and individual topics. 

 

You can learn more about the visualization here.

 

 

Interpreting the Word-Relevance Summary

 

The Word-Relevance Summary is effectively the data-stream version of the visualization. It returns two metrics: relevance and saliency.

 

The intent of salience is to help identify which words are the most informative words for identifying topics in all the documents. Higher saliency values indicate that a word is more useful for identifying a specific topic. Saliency is always a positive value, and it does not have a maximum. A value of 0 indicates that a given word effectively belongs equally to all topics. Saliency is designed to look at words on a corpus-scale, as opposed to an individual topic levels.

 

You can read more about saliency here.

 

 

topicmodeldataoutput.PNG

 

 

 

Relevance is a metric used for ranking terms within topics. It helps identify the most relevant words within a given topic. It reflects the level at which a word belongs to a certain topic at the exclusion of other topics. Relevance uses a parameter called lambda (which you can adjust using the slider in the visualization) to weight the probability of a term within a topic relative to its lift. The authors of this visualization found that the optimal value for lambda for topic interpretation is 0.6, which is the value used to calculate relevance in the data output. You can read more about relevance here.

 

Neither metric is normalized to any specific scale. Saliency can be any positive number, and relevance can be any number. These metrics are relative values you can use to identify the most helpful terms for describing and understanding a given topic. The higher the saliency value, the more helpful the term is for distinguishing the topic. The higher the relevancy metric is for a given topic, the more exclusive that term is to the given topic.

 

Get into the weeds with another walk-through on LDA with code example that gives additional context about interpretation; and learn more about evaluation methods for topic models.

 

Now it's time to get modeling!

Sydney Firmin

A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. She currently manages a team of data scientists that bring new innovations to the Alteryx Platform.

A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. She currently manages a team of data scientists that bring new innovations to the Alteryx Platform.

Comments
alexandramannerings
8 - Asteroid

Hi, Syndey! I really liked your series on topic modeling. I'm still struggling with the difference between salience and relevance, however. The links you provided unfortunately go to articles behind paywalls, and when I google around for other resources I can't really find anything. The main thing I'm struggling with is that if a saliency score of 0 means that a word equally belongs to all topics, then it seems that a high score indicates a large degree of exclusivity to that topic. But isn't that what relevance is measuring...?? The only thing I can think is that a word that is 100% associated with a particular topic would be very salient even if it showed up, say, 60% of the time in other topics, but then it wouldn't be very relevant. Is that at all close?

 

Also, your image in the salience vs relevance section is the picture of the data output rather than the word-relevance summary, which is what you are discussing. I was curious as to what is actually contained in the Data output chart. I'm guessing it's a measure of saliency for the whole text segment to each topic...? As in, how much does that text entry belong to each of the topics?

SydneyF
Alteryx Alumni (Retired)

Hi @alexandramannerings,

 

I've updated the saliency link to point to the pdf of the paper, so hopefully that resolves the paywall issue. I believe all the other links in the article point directly to PDFs, open-source academic pages, or Wikipedia, but please let me know if I'm missing anything.

 

I've also updated the image to reflect the data output of the Report anchor, but answer your question, in the data output, the Topic columns indicate the proportion of words in the document (row) that were "contributed" by each of the identified topics. These topic fields can be useful to assigning a dominant topic to each document you fed into the topic modeling tool, or as new features for a downstream machine learning model. 

 

As far as the difference between saliency and relevance goes, hopefully this helps:

 

Saliency is a metric used to describe how general or specific a term is in the context of the topics generated by the topic model. So while a low saliency score indicates that a word has been assigned to most topics, and a higher saliency score indicates a word that has been assigned to fewer topics, saliency does not say anything about which topic(s) the term is assigned to. The intent of salience is to help identify which words in the collection of documents are the most informative for distinguishing topics. The most frequently occurring words in the collection of documents that are assigned to the fewest topics will have the highest saliency scores. 

 

Relevance is a topic-specific metric. So for each term, while sailency is calculated just once, relevance is calculated individually for each topic. 

 

Relevance is used for ranking terms within the generated topics. It helps identify the most important/relevant words for a given topic. In the interactive visualization, you can adjust the relevance metric with the slider at the top of the visualization. To display words more exclusively assigned to the selected topic, you can set the slider at the top of the visualization closer to zero. To display more common (frequently occurring) terms you can set the slider closer to 1. In the value output, all relevance metrics are calculated with the "slider value" (lambda) set to 0.6. 

marksusol
5 - Atom

Looking for the code sample that produces the "Interpreting the Word-Relevance Summary" table you provided above.

 

I assume this derived from LDAvis data?

NeilR
Alteryx Alumni (Retired)

@marksusol choosing "Word-Relevance Summary" under Output Options in the Topic Modeling tool configuration gives you this table from the R output, see below...

 

NeilR_1-1634578588468.png