Data Science

Machine learning & data science for beginners and experts alike.
vravichandran
Alteryx
Alteryx

Gathering feedback from customers in the form of surveys, closed group interviews, online reviews, or through third-party services is a job half-done. What matters most is analyzing the data you have collected. So why do companies struggle with analyzing their customer feedback data? Often times it is because they don’t have the tools to synthesize non-numerical data, nor do they know what techniques to use. If this sounds like you, then you have come to the right place. We are going to use the Alteryx Intelligence Suite - Topic Modeling tool on a Customer reviews data set to learn more.

 

Source: GIPHY

 

Approach

 

Topic Modeling is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. It is a frequently used text-mining tool for discovering hidden semantic (study of meaning) structures in a text body.

 

A Document is a collection of words, and a Corpus is a collection of documents. In this example, the corpus is all the reviews by users. Each row (review) is a document. Another example could be tweets, where each tweet is a document, and a collection of tweets about one subject is a corpus.

 

The Topic Modeling tool uses the famous Latent Dirichlet Allocation (LDA) method (Latent Dirichlet Allocation). It boils down to the probability of a word given a topic and a probability of a topic given a word. The algorithm then matches the overlay of those two probabilities. Parameter tuning is minimized in the tool, and with the built-in visualization (using pyLDAvis), you can unlock even more insights!

Let's use a data set from Kaggle for analyzing customer reviews (Link Here) and build Topic Modeling.

 

Ready to do this using Intelligence Suite? Download the Intelligence Suite Trial and Starter Kit today!

 

About the data

 

The data set contains over 12,000+ reviews for 14 organization-focused applications. The goal is to understand what are the common topics the users are talking about, identify the sentiment and topics of interest, and take action accordingly.

 

More details are available on the Kaggle page.

 

Let’s Build an Alteryx Workflow

 

vravichandran_0-1675103845199.png

 

Step 1: Bring in the “reviews.yxdb” data set into the canvas

 

Step 2: Drop in a select tool to verify field type and size

 

Step 3: Use a Data Cleansing tool to cleanse the incoming data. For example, remove null rows, remove leading or trailing whitespace, extra tabs, line breaks, and convert to lowercase.

 

image-20230124-160529.png

 

Optional Step: Applying Sentiment Analysis

 

Sentiment Analysis is an approach to natural language processing (NLP) that identifies the emotional tone behind a document.

 

vravichandran_2-1675103845212.png

 

vravichandran_3-1675103845230.png

 

Suppose you decide to run a topic modeling for a specific sentiment. For example, “Negative” sentiment to see what topics users are talking about, you can use this section before Text Pre-processing and Topic Modeling.

 

It is recommended to use the document as is for the sentiment analysis. If text pre-processing is applied before these steps, it may not yield good results.

 

image-20230124-163118.png

 

  • Select language “English”
  • Select Algorithm “VADER”
  • Select Text Field - “Content”
  • Select “Find Sentiment at Sentence Level” - This will calculate the sentiment for each review.
  • Select “Output Categorical Sentiment” - This will output a column with “Positive,” “Negative,” or “Neutral.”
  • Filter for sentiment and use it for applying Topic Modeling (Continue to Step 4)
  • As an optional step, use the Word Cloud tool to see a cluster of words. This helps to visually see the top words.

 

vravichandran_5-1675103845297.png

 

Step 4: Let’s use a Text Pre-processing tool to clean up the input data set. In this case, the data has repetitive words, digits, and punctuations. Cleaning these will make the result better and will help downstream analysis.

 

Note: One could apply sentiment analysis first and then apply Topic modeling to understand the negative or positive sentiment topics. Or apply Topic modeling to the entire data set. The attached workflow has both methods.

 

vravichandran_6-1675103845300.png

 

image-20230105-160739 (1).png

 

Step 4.1: Select the language that aligns with the majority of the language in the reviews

 

Note: The tool supports English, French, German, Italian, Portuguese, and Spanish.

 

Step 4.2: Select the Text field as “Content”

 

Step 4.3: You could use the lemmatization option to convert words to their common root in order to improve the alignment of words to a topic. For example, “caring” would be replaced with “care,” and “feet” would be replaced with “foot.”

 

Step 4.4: Applying Filters - The tool allows filtering “Digits,” “Punctuations,” and “Stop Words.” Use these options to filter any unwanted words or digits from the data set. The tool uses default stop words. If you wish to use your own stop words, you can input them using the space provided. For example, company or product names.

 

Step 4.5: After pre-processing, the tool outputs a new column with the suffix “_processed.” In this case, we got “Content_processed.” Rename it to “Content” and drop the original column.

 

Note: Options like Lemmatize and Filters will help the topic modeling algorithm in assigning words to topics and improve the overall results. If the data is not cleaned properly, then it will show up in the results.

 

Step 5: After text pre-processing, the data is ready to be used with the Topic Modeling tool. Let’s drag and drop the Topic Modeling tool into the canvas and add browse tools.

 

image-20221230-165113.png

 

image-20221230-165128.png

 

Step 5.1: Set Text Field to “Content”

 

Step 5.2: Set Number of Topics: Setting the number of topics is the most time-consuming and repetitive process. There are a lot of studies and findings that recommend the optimal number of topics for topic modeling. Often times you may achieve the optimal number of topics by iterating through the steps and validating the results. We are attempting to find the optimal number of topics where the algorithm isn’t memorizing the data (overfitting) but hasn’t stopped too early to get the best answer (underfitting). Most of the research papers and articles recommend starting with a higher number and reducing based on the output and interpretation.

 

Note: overfitting/underfitting is a concept in Machine learning where the prediction corresponds too closely or exactly to a particular set of data and may therefore fail to fit additional data or predict future observations reliably.

 

For this data set, I started with 20 topics and iterated down to 5. Below steps will show more details on this.

 

Step 6: Selecting the number of Topics and Understanding the Output: (if you choose Interactive Chart)

 

What do the bubbles and bars represent? Each bubble represents a topic. The larger the bubble, the higher percentage of the number of reviews are about that topic.

 

The green bars represent the word’s frequency in the overall dataset. The blue bar is the frequency in the topic.

 

image-20221230-171824.png

 

image-20221230-171935.png

 

LDA Assumptions:

  1. Every document is made up of a mixture of topics
  2. Every topic is made up of a distribution of Keywords
  3. Every Keyword has a probability of belonging to a topic
  4. Given a document, we can allocate the most likely topic by looking at the document keywords.

 

For the given data set, I selected 20 topics to start with, and the model resulted in the below visualization. You will notice a few overlapping bubbles.

 

A good topic model will have big and non-overlapping bubbles scattered throughout the chart.

 

image-20221230-170602.png

 

Upon seeing multiple overlapped topics, I reduced the topics to 12, and then I got the below visualization. Still, there are a few overlaps.

 

image-20221230-170908.png

 

Now, let’s see if we get non-overlapping bubbles. I set the number of topics to 6, but it may require one last tweak.

 

image-20221230-171249.png

 

And finally, I set the topics to 5, then nice and clean bubbles away from each other showed up.

 

Note: Optimal number of topics to choose is not always achieved by reducing the number of topics. Sometimes you have to play around with the interpretation by increasing or decreasing the number of topics. In this case, I used the “Explore the number of topics to choose” section from the workflow

below to achieve the optimal number of topics.

 

image-20221230-171504.png

 

Relevance metric (λ):

 

image-20230124-215300.png

 

A “relevance metric” slider scale at the top of the panel controls how the words for a topic are sorted. As defined in the article by Sievert and Shirley, “relevance” combines two different ways of thinking about the degree to which a word is associated with a topic.

 

On the one hand, we can think of a word as highly associated with a topic if its frequency in that topic is high. By default, the relevance value in the slider is set to “1,” which sorts words by their frequency in the topic (i.e., by the length of their blue bars).

 

On the other hand, we can think of a word as highly associated with a topic if its “lift” is high. “Lift” means basically how much a word’s frequency sticks out in a topic above the baseline of its overall frequency in the model (i.e., “the ratio of a term’s probability within a topic to its marginal probability across the corpus,” or the ratio between its blue bar and green bar).

 

Experiments on Topic Modeling using PyLDAvis by the author Lucia Dossin recommends setting the relevance to “0.6” for optimal results. The below example shows the difference in the results when we change the relevance metric from “1” to “0.6”.

 

image-20230124-214757.png

 

image-20230124-214819.png

 

Step 6.1: Understanding the Word Relevance Output: (if you choose Word-Relevance summary)

 

image-20230113-200754.png

 

If you are interested in looking at the data instead of Visuals, this option will be helpful. The attached workflow section uses a word-relevance summary to dive deep into the number of topics to choose.

 

Here are some useful definitions to understand the output from word relevance.

 

Saliency helps us identify the words that are most informative to identify topics within documents. A higher salience value indicates that a word is more useful in identifying a specific topic. It is always a positive value and does not have a maximum. It is designed to see specific words in relation to the totality of documents that we are analyzing; a value of 0 indicates that a word is present in all topics.

 

Topic Relevance is a metric used to order words within topics. It helps us to identify the most appropriate words for each topic and reflects the level at which a word belongs to a topic. The higher the value for a given topic, the more important that word will be for that topic.

 

Note: Dictionary and LDA options are Advanced options. Use the default when starting and tune only if you understand how the parameters will impact the output.

 

Step 7: Exploring Topics and Content:

 

The attached workflow has a section that will help you drill down on the number of topics to select. Here, we assign each word to its document for reference. The idea is to verify the content visually and check that the assigned topics are relevant. As mentioned above, this step is a bit iterative until you decide on what is the optimal number of topics to choose.

 

image-20230113-194703.png

 

For example, topic 1 had top words such as Time, Work, and Pay. Here is a quick look at a few reviews that were assigned to topic 1. Based on this, we could say the users are talking about App functionality.

 

image-20230120-195029.png

 

For another example, topic 2 had top words as Account, Use, and Version. Here is a quick look at a few reviews that were assigned to topic 2. Based on this, we could say the users are talking about User Experience.

 

image-20230120-201057.png

 

Topics Names:

 

Using the method above, we can determine a topic name for each of the 5 topics. You could base it on top words in that topic.

  • Topic 1 is about App functionality like reminder, task, notification etc.
  • Topic 2 is about User’s experience with the app.
  • Topic 3 is about Pay and Premium related options.
  • Topic 4 is about Calendar and daily task, etc.
  • Topic 5 is about Usability of the app.

 

image-20230120-201611.png

 

image-20230120-201731.png

 

Advanced Options:

 

In this article, we used default values for Dictionary and LDA options.

 

Below are some advanced options that can be used to further improve the results. Increasing or decreasing these values will modify the way the model treats any word in a document. For example, if we increase the min frequency, the model starts to ignore terms that appear infrequently and are unlikely to reflect the topics in a document.

 

Dictionary Options:

  • Min Frequency and Max Frequency are the frequencies at which a word can appear in a body of text before the model ignores the word, where the frequency is measured by the number of documents containing a word divided by the total number of documents in the body of text.
  • Max Words specifies how many words you want the model to consider based on how frequently the words appear across all the documents.

 

LDA Options:

  • Alpha represents the density of topics the model should expect in each document. Increasing Alpha allows the model to recognize a greater number of distinct topics in a document. Decreasing Alpha limits the number of topics the model recognizes in each document.
  • Eta represents the density of words needed to make up a topic. Increasing Eta increases the number of words needed to identify a topic. Decreasing Eta reduces the number of words needed to identify a topic.
  • More information is available here 

 

Final Thoughts

 

With minimal steps, we were able to build a flow to understand what customers are talking about in app reviews. Although drilling down (iterating through) the number of topics took more time and effort, we were able to get meaningful output in the end. It would have taken a long time to understand the 12 thousand comments and categorize those appropriately. Alteryx Intelligence Suite’s Topic Modeling tool will help you get to the solution quickly. Try out the attached workflow zip file and explore the results.

 

Please do reach out if you have any questions.

 

Source: GIPHY

 

How to run the workflow

  1. Download the zipped workflow from our Community Gallery
  2. Unzip the file to a local folder.
  3. Run the workflow!

 

Data Source

https://www.kaggle.com/datasets/prakharrathi25/google-play-store-reviews

 

Resources

Comments