Data Science

Machine learning & data science for beginners and experts alike.
Garabujo7
Alteryx
Alteryx

In Part 3 we review the Word-Relevance Summary and visualization data. It returns the two previously mentioned metrics: relevance and saliency.

 

Saliency

It helps us identify the words that are most informative to identify topics within documents. A higher salience value indicates that a word is more useful in identifying a specific topic.

 

Salience is always a positive value and does not have a maximum. It is designed to see specific words in relation to the totality of documents that we are analyzing; a value of 0 indicates that a word is present in all topics.

 

Relevance

It is a metric used to order words within topics. It helps us to identify the most appropriate words for each topic, and reflects the level at which a word belongs to a topic. The higher the value for a given topic, the more important that word will be for that topic.

 

Both metrics show relative values that we can use to describe and understand a specific topic. For a deeper dive, see Getting to the Point with Topic Modeling - Interpreting the Results.

 

Garabujo7_0-1628694803119.png

 

Assigning Tags to Topics

Assigning tags to topics allows us to label documents for categorization. Select the R output of the Topic Modeling tool and insert a Formula tool after, to be able to extract the topic to which each word belongs.

 

Garabujo7_1-1628694819711.png

 

Garabujo7_2-1628694826068.png

 

Garabujo7_3-1628694832164.png

 

The MaxIDX formula will give us the maximum value among the three relevance fields. The result is an integer, at the end we add 1. In this way we will have assigned a topic for each word, along with its relevance.

 

Garabujo7_4-1628694848562.png

 

The next step is to add a Sample tool to select only the first N words of each topic we create.

 

Garabujo7_5-1628694866542.png

 

Garabujo7_6-1628694877135.png

 

We get the 3 most prominent and relevant words:

 

Garabujo7_7-1628694887757.png

 

The next step is to create the tags based on the topic terms. To make it dynamic, use a Summarize tool to create a concatenated field with the three words to serve as a label for the topic.

 

Garabujo7_8-1628694892366.png

 

Using a Find and Replace tool we can change the topic numbers to text labels that make more sense for business users consuming this analysis.

 

Garabujo7_9-1628694908738.png

 

Garabujo7_10-1628694911931.png

 

Garabujo7_11-1628694920888.png

 

Now we have each document tagged with the topic that belongs to it. With that we can summarize the topics to count how many documents belong to each category.

 

Garabujo7_12-1628694937885.png

 

Visualize each topic in a custom word cloud

To categorize each document within its topic, we will use a similar process. Taking output D from the Topic Modeling tool, we add a Formula tool to it. Use the MaxIDX() function to obtain the topic that has the most relevance for each document.

 

Garabujo7_13-1628694946458.png

 

Filter each topic to view it independently.

 

Using the Word Cloud tool, we will set up the visualization.

 

Garabujo7_14-1628694958557.png

 

First, select the field that we want to visualize. To customize the word cloud, select the corresponding option.

 

Garabujo7_15-1628694971783.png

 

There are several options for customization:

 

  • Choose a color for the background.
  • Select the maximum number of words to evaluate
  • Resizing
  • Masking means that we can define the shape of the word cloud.

Garabujo7_16-1628694991493.png

 

To take an image as a template, add a Blob Input tool (binary large object) from the Developer tool category and select the path where the file is located.

 

Garabujo7_17-1628695006439.png

 

Garabujo7_18-1628695023373.png

 

Once this is done, in the Word Cloud configuration, the Blob option will appear in the mask option. Run the workflow and the word cloud is presented. In this case I used the twitter logo to shape the report.

 

Garabujo7_19-1628695054830.png

 

Garabujo7_20-1628695069489.png

 

 

The last part in this series will demonstrate how to export a trained topic model to score new items and speed up the process.

Comments