Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Data Science

Machine learning & data science for beginners and experts alike.
SydneyF
Alteryx Alumni (Retired)

Understanding the topic of a piece of writing is typically an easy task for people. Based on the title of this article, or the context of what type of stuff I have been writing lately, you may have already deduced this blog post is going to be about Natural Language Processing on the many posts and articles of the Alteryx Community. Part of what makes topic deduction easier for people is context. By living our rich and thrilling lives, we gain context on the ways of the world and the types of things that get written about. Understanding the topic of something is a little trickier for computers, which tend to live in boxes and lack the rich context and connections that people have.

 

However, there are times where we need to train our computers to find topics in a collection of documents. There might be too many documents for you, a single human, to read through, or you may be interested in discovering underlying themes in a large set of texts.

 

feellove.gif

 

LDA Overview

 

Enter Latent Dirichlet (pronounced something like “Deer-ish Sleigh”) Allocation, a popular model for Topic Modeling. Latent Dirichlet Allocation (LDA) is a Bayesian network that models how documents in a corpus are topically related. LDA is a way to cluster discrete data where each observation can belong to more than one cluster. It is an unsupervised machine learning algorithm. 

 

Before we get into how the model works, let's frame this article with the following definitions. Documents (articles, posts, etc.) are made up of topics. These topics are made up of words. Documents can be made up of any combination of topics, where each topic is represented as probability distributions over a set of words.

 

In the mind of an LDA model, documents are written by first determining what topics the article is going to be written about as a percentage break-down (e.g., 20% Python, 40% NLP, 10% Puppies, and 30% Alteryx Community), and then filling up the document with words (until the specified length of the document is reached) that belong to each topic. For an LDA model, context doesn’t matter, only the distribution of words. Each document in a corpus is effectively a bag of words.

 

Given how an LDA model thinks a document is written, we can think about how it creates topic models. LDA attempts to work backwards based on this generative model to identify the topics that were used to generate the corpus. 

 

To enable the LDA model to “solve backwards,” we need to give it a few parameters to go on. We need to supply it with the number of topics it is creating, as well as a beta value and an alpha value, where the beta value is the parameter of the uniform Dirichlet prior on a per-topic word distribution, and the alpha is the parameter of the uniform Dirichlet prior for the per-document topic distribution. If that all seemed like gibberish, don’t worry too much. What you need to know is that a high alpha makes documents appear more similar to one another (meaning that each document will be a mixture of topics), and a low alpha makes documents more homogenous (containing high proportions of fewer topics). Similarly, a high beta makes topics appear more similar to each other by making each topic a mixture of most of the words in the corpus, where a low beta will make each topic a mixture of just a few of the words. If you would like to understand the math a little better, there are nice explanations of Dirichlet distributions here and here.

 

The structure of LDA is often expressed in Plate Notation, which is a way to represent a graphical model as an illustration, where groups of variables are repeated together.

 

In this visualization, the arrows indicate dependencies (e.g., the word used is dependent on beta and the topic of the word). The shaded circle indicates that a variable is observable, and the empty circles indicate a latent variable.

 

platenotation.png

 

Once we have provided the necessary parameters, the LDA model kicks off by randomly assigning all of the words in each document to one of k (how every many you specified) topics. It assumes that all topics and words in the model are correct except for the one it is working on refining. While working on a topic, it calculates the proportion of words in the topics that are currently assigned to a given topic and calculates the proportion of assignments over all documents that come from a given word (the probability that topic t generated word w). It shifts words around topics until a stable state is reached where the assignments make sense. And it works on refining the topics in the documents one at a time, rearranging the words to get a better fit.

 

If this overview isn’t doing it for you, there is a collection of helpful resources scattered across the internet. There is this overview of the LDA Algorithm, a slightly longer video with an applied example, and an even longer lecture from a professor at CU Boulder (Sko Buffs!). If you’re in the mood for a written document, there is a fun article that describes LDA with emojis from Medium.

 

Topic Modeling on the Community

 

For Topic Modeling on the Community, we will be using the Python Gensim module. In addition to Gensim, we will be using spaCy for its lemmatization feature, numpy, and a really neat package called pyLDAvis to create a visualization at the end.

 

The first step is to import all the required packages.

 

import os
import re
import numpy as np

#gensim
import gensim
import gensim.corpora as corpora

#spaCy  is used for lemmatization
import spacy

#plotting tools
import pyLDAvis.gensim

from pprint import pprint

 

We can read in the corpus as a Sentence stream, reading in all of the .txt files from a directory, where each new line is a new document. With this code, each document is tokenized (i.e., split into a list of words), stripping punctuation and making all letters lowercase. 

 

class Sentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname), encoding = "latin-1"):
                yield [x for x in re.sub('\s_\s', " ", re.sub(r'[\.]', " ", re.sub(r'[\s+]', " ", re.sub(r'[-]', "_", re.sub(r'[\"\'\|,:@#?!$"%&()*+=><^~`{};/@]', "", line.lower()))))).split() if x != " "]

sentences = Sentences("C:\\Users\\CommunityPosts\\")

 

Pre-processing for LDA is particularly important because each document is considered to be a collection of words and each word an individual data point. LDA does not consider the order or grammar of words. Without preprocessing, LDA will recognize Help, help, Helps, helps, HELP, HELP!!!!, and helping all as completely distinct words.

 

In addition to making all characters lowercase and stripping punctuation, we will be applying a collocation (phrases) model (featured in a recent data science blog post), conducting lemmatizationand filtering out frequently used words. 

 

The bigram model is trained on the input data set using the Phrases function from the Gensim package. In this application, I loaded a Phrases model that I had previously trained on the Community corpus.

 

# phrases model and function
bigram = gensim.models.Phrases.load('bigram.model')

 

Lemmatization is the process of grouping inflected forms of a word so that they can be analyzed as one word or concept identified as the word's lemma. It is a normalization process that allows words like dog, dogs, dog's, and dogs' to all be represented as dog, or the words is are and am to be represented as be. This is preferable to the similar process stemming because lemmatization tends to produce more readable results. Because the words in the corpus will be used to define the topics, it is important to have interpretable and readable words. Note that spaCy will replace any personal pronouns (e.g., I or you) it identifies with -PRON-, and that this function converts the sentence stream input into a list of lists.

 

 

# Lemmatization with spaCy

# spacy 'en' model, keeping only tagger component (improves processing time)
spacy_lem = spacy.load('en', disable= ['parser', 'ner'])

 

After defining each of the data pre-processing models, we can apply them to our corpus, resulting in a clean, pre-processed text dataset. 

 

# apply bigrams
bigramstream = bigram[sentences]

# inception list comprehension to apply spaCy lemmatization to documents.
texts = [[token.lemma_ for token in lemmatizer(" ".join(document))] for document in bigramstream]

 

Now we can create a dictionary representation of the documents. A Dictionary object maps the text tokens (words) to their numerical IDs. This is a necessary step for implementation because most algorithms rely on numerical libraries that work with vectors indexed by integers, with known vector/matrix dimensionality.

 

 

# create dictionary of words in corpus
communitydict = corpora.Dictionary(texts) 

 

The dictionary will contain all of the words that appear in the corpus, along with how many times they appeared. Once the corpus is in a dictionary, we can filter very rare or common words. If a word occurs in 80-90% of the documents, or in a very small subset of documents, it is probably not helpful for identifying a topic. In this code, we are filtering words that occur in fewer than 30 documents, or that occur in more than 10% of documents. We do this to ensure we are only getting meaningful, relevant words for topic definitions.

 

# filter words - remove rare and common tokens
communitydict.filter_extremes(no_below=30, no_above=0.1)

 

After all of our filtering, we end up with just under 4000 words to create topics with.

 

Dictionary(3982 unique tokens: ['accept', 'additional', 'address', 'alteryx_designer', 'aren_t']...)

 

As the last step before training the LDA model, we need to transform the documents into a vectorized, bag of words (bow) format, using the Gensim doc2bow() function and another list comprehension:

 

# term document frequency 
corpus = [communitydict.doc2bow(text) for text in texts]

 

Now that we have finished data pre-processing, we can train our LDA model! The blog post from Mining the Details, Gensim LDA: Tips and Tricks is really helpful for understanding a few of these arguments.

 

# build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=communitydict, num_topics=25, random_state=42, 
                                            update_every=1, chunksize=100, passes=50, alpha='auto', 
                                            per_word_topics=True, eval_every=eval_every, dtype=np.float64)

 

Now lets examine the topics that the LDA model found:

 

# gather topics and top words
top_topics = lda_model.top_topics(corpus, topn=20)

# print results in an attractive format
pprint(top_topics)

 

The printed top 20 words for each topic:

 

Spoiler
[([(0.020695580423620193, 'row'),
(0.020011242784347297, 'formula'),
(0.018599296583477785, 'column'),
(0.013588141693289426, 'regex'),
(0.01033067729605757, 'parse'),
(0.010259747704317868, 'fun'),
(0.009542918811638176, 'solution_to_read'),
(0.009325463434261069, 'multi_row'),
(0.00893292850468382, 'formula_tool'),
(0.008684688598017362, 'solve'),
(0.008338891165488293, 'approach'),
(0.008097284494434675, 'filter'),
(0.00780559544745668, 'here_is_my_solution'),
(0.007406627197261226, 'nice'),
(0.0070931028525079536, 'bit'),
(0.006705337772992282, 'match'),
(0.0066906038412741205, 'read_to_read'),
(0.006120161533681971, 'heres_my_solution'),
(0.006071555498838694, 'learn'),
(0.005686126404121368, 'iterative_macro')],
-1.4108212097779056),

([(0.023529394935676337, 'analyst'),
(0.023293772962139647, 'organization'),
(0.02074950702865459, 'big_data'),
(0.013074900906929829, 'insight'),
(0.012674802447542257, 'analysis'),
(0.012103346278365243, 'data_blend'),
(0.012068491149929726, 'deliver'),
(0.011564303190761377, 'predictive_analytic'),
(0.0094988304433296, 'advanced_analytic'),
(0.008610070755130156, 'today'),
(0.00819131421714101, 'tableau'),
(0.008087252822162536, 'visualization'),
(0.00733170057763311, 'blend'),
(0.0068817112025207645, 'self_service'),
(0.006730189006778379, 'market'),
(0.006678768115331216, 'line_of_business'),
(0.006631729356219946, 'webinar'),
(0.006357341376114067, 'platform'),
(0.006309106114531908, 'enable'),
(0.006306421730745754, 'data_scientist')],
-1.721910743907523),

([(0.028397373495646576, 'inspire'),
(0.01753733515424845, 'session'),
(0.009493662577449507, 'conference'),
(0.0074649721965511125, 'event'),
(0.007114559996494287, 'people'),
(0.006712520834152984, 'learn'),
(0.006039972992989308, 'user_group'),
(0.005917436693343427, 'ace'),
(0.005698164530144839, 'love'),
(0.0055362773389305816, 'pm'),
(0.004814769643636668, 'opportunity'),
(0.004803084409792049, 'attendee'),
(0.004776088161133497, 'meet'),
(0.004694738527392643, 'hear'),
(0.004447416654816952, 'track'),
(0.004423394767311015, 'story'),
(0.004388328760054483, 'world'),
(0.004260335477287717, 'blog'),
(0.004216587514543972, 'attend'),
(0.004140876625267237, 'partner')],
-1.781551485736102),

([(0.04582962916837095, 'model'),
(0.017801121912602684, 'r'),
(0.0166821773075764, 'variable'),
(0.009665793275730968, 'predict'),
(0.008406022738666993, 'sample'),
(0.008090882943418256, 'dataset'),
(0.008083387569849778, 'score'),
(0.007792287042981745, 'algorithm'),
(0.007045401610432731, 'county'),
(0.006704513469866865, 'plot'),
(0.006655225993652604, 'analysis'),
(0.006461535952912632, 'predictive'),
(0.006436212075801692, 'estimate'),
(0.0057658976121342555, 'measure'),
(0.005606005268008011, 'prediction'),
(0.005597889110967721, 'compare'),
(0.005439905883753477, 'figure'),
(0.005316314897221247, 'method'),
(0.005182810234879585, 'function'),
(0.005119507617503412, 'fit')],
-1.8039660246184777),

([(0.016435416683044362, 'platform'),
(0.011939336990657867, 'capability'),
(0.01182527739193118, 'project'),
(0.01137750863295518, 'system'),
(0.010172089523477552, 'spark'),
(0.007726167805147196, 'enterprise'),
(0.00760682833419796, 'data_source'),
(0.007526056029785542, 'application'),
(0.007498725999310362, 'release'),
(0.007461574875805607, 'integration'),
(0.0073297643148498215, 'software'),
(0.007235374944897183, 'scale'),
(0.006716181728931385, 'develop'),
(0.006400847988724821, 'enable'),
(0.006381861096018541, 'environment'),
(0.0058660035327072636, 'benefit'),
(0.005836886387815695, 'development'),
(0.005793268613330379, 'alteryx_designer'),
(0.005653141584067736, 'addition'),
(0.005529361747751699, 'analyst')],
-1.8231757853187545),

([(0.012847527264999938, 'achieve'),
(0.01242290347576571, 'analysis'),
(0.012023107606175467, 'month'),
(0.009944791535134894, 'project'),
(0.009455050549353035, 'dashboard'),
(0.009441557862967754, 'week'),
(0.009334705908028765, 'employee'),
(0.00882888310107117, 'describe_the_benefit'),
(0.008816891219323915, 'excel'),
(0.008672456183038807, 'hour'),
(0.007560416660977143, 'daily'),
(0.007021011232387925, 'system'),
(0.006395068490289318, 'forecast'),
(0.0061956780656895775, 'identify'),
(0.006119358626399816, 'tableau'),
(0.0058087875387583835, 'metric'),
(0.0057460452372290315, 'visualization'),
(0.005389743293058159, 'needed_to_solve'),
(0.005389737583023549, 'describe_the_work'),
(0.0052557832159785415, 'level')],
-1.8406182817019319),

([(0.019288731425265434, 'exam'),
(0.019233528455285892, 'training'),
(0.015022157539702112, 'student'),
(0.014328955342406169, 'video'),
(0.013779646291934091, 'learn'),
(0.013529429896529171, 'certification'),
(0.012711780634151357, 'course'),
(0.012433822408172989, 'review'),
(0.011349657095873635, 'program'),
(0.010671828902499335, 'pass'),
(0.010129205244743308, 'partner'),
(0.009885398897719236, 'resource'),
(0.008603037213350363, 'designer'),
(0.008566844865150722, 'entry'),
(0.008478492741067089, 'alteryx_designer'),
(0.0065736806320999385, 'guide'),
(0.006572841959007066, 'section'),
(0.006347083619906077, 'content'),
(0.0062897452072069185, 'receive'),
(0.006056886990451579, 'project')],
-1.916534307762919),

([(0.022061384916288795, 'column'),
(0.01506781065568456, 'filter'),
(0.010901909596177942, 'null'),
(0.010184214446298063, 'row'),
(0.009825100861470537, 'property'),
(0.0096752280391286, 'box'),
(0.009039938646575745, 'return'),
(0.00870333746467137, 'stream'),
(0.008113928146452686, 'match'),
(0.007238700996815709, 'specify'),
(0.007175289201779777, 'sort'),
(0.0068436021829196845, 'count'),
(0.006643137058434104, 'remove'),
(0.006609407558053242, 'rename'),
(0.006370051182194785, 'choose'),
(0.006369594234147002, 'drop_down'),
(0.006309660514046302, 'default'),
(0.005837265779177482, 'append'),
(0.00565126404674956, 'replace'),
(0.005613422932795999, 'apply')],
-1.9480479919961944),

([(0.013555887514475881, 'insight'),
(0.012030134358668891, 'sale'),
(0.009743029947684622, 'marketing'),
(0.00846782553470163, 'drive'),
(0.008359318644964689, 'store'),
(0.007673901230956347, 'service'),
(0.007602726307965364, 'improve'),
(0.006852587734199537, 'retail'),
(0.0063246859047269375, 'lead'),
(0.005869905240850228, 'tableau'),
(0.005794704264613193, 'analysis'),
(0.005790765584248087, 'industry'),
(0.005786696319230313, 'survey'),
(0.005377876827751532, 'qlik'),
(0.005324660387428519, 'decision'),
(0.005196028980445149, 'retailer'),
(0.005078400860568244, 'target'),
(0.004994194365520417, 'hour'),
(0.004965243436498029, 'impact'),
(0.004914220593379361, 'identify')],
-2.005099755226644),

([(0.027895288990154428, 'rank'),
(0.020017709062860975, 'weekly_challenge'),
(0.015426679699583229, 'week'),
(0.014881599576766805, 'welcome'),
(0.011939536765824732, 'love'),
(0.011620230044398001, 'seanadam'),
(0.011456735159730128, 'awesome'),
(0.011183575166816954, 'badge'),
(0.010509118841686451, 'top'),
(0.010110667462712825, 'patrick_digan'),
(0.009032136645339724, 'intermediate'),
(0.008790114318587821, 'ranking'),
(0.00808137117998476, 'lordneillord'),
(0.007633131862798772, 'everyone'),
(0.007493915379577319, 'analysis'),
(0.007282216928886164, 'range'),
(0.0072166391600903945, 'solve'),
(0.0069630130481704265, 'tableau'),
(0.0069372701948944146, 'generate_row'),
(0.006628733935085583, 'learn')],
-2.009784958498937),

([(0.02282930627361656, 'log'),
(0.021290932339173358, 'credential'),
(0.018666786982149318, 'schedule'),
(0.016722711097457068, 'password'),
(0.015931181013013055, 'service'),
(0.015082916244396406, 'designer'),
(0.01375384352871289, 'account'),
(0.010572146355710677, 'window'),
(0.010556573243958955, 'setting'),
(0.010314391908109817, 'admin'),
(0.010026541153535297, 'permission'),
(0.00970915856077299, 'default'),
(0.009202573402148926, 'authentication'),
(0.008927864917945658, 'security'),
(0.008845819783475515, 'collection'),
(0.008602824336617541, 'port'),
(0.008571780912760712, 'machine'),
(0.00814380320750843, 'administrator'),
(0.007836514858975643, 'system'),
(0.007248994014152426, 'scheduler')],
-2.0495103296716133),

([(0.01802969309662748, 'word'),
(0.017863542201158964, 'module'),
(0.016606304535601107, 'excel'),
(0.008113860546990611, 'sheet'),
(0.007714503263499146, 'letter'),
(0.006281641923357433, 'actually'),
(0.006192337984885741, 'cell'),
(0.005895214074471618, 'browse_tool'),
(0.00589002249826699, 'always'),
(0.0057877700740173385, 'wizard'),
(0.0054475115528324165, 'little'),
(0.005433975323120903, 'step'),
(0.005103032953959196, 'browse'),
(0.004470599360101434, 'excel_file'),
(0.004394869429962214, 'love'),
(0.004184357069916334, 'big'),
(0.004141542195552999, 'import'),
(0.004135197847252612, 'count'),
(0.0041286903438157805, 'next'),
(0.00403341054881167, 'line')],
-2.1534038598868848),

([(0.03904195664475774, 'connector'),
(0.027253597087080347, 'release'),
(0.025221246928646968, 'page'),
(0.016437391025590206, 'feedback'),
(0.012711546769212946, 'profile'),
(0.011181633281273603, 'beta'),
(0.010297662448368114, 'ill'),
(0.009107671858798487, 'link'),
(0.008319855635100333, 'content'),
(0.007495004190944284, 'message'),
(0.007378890245776566, 'search'),
(0.007110820252288945, 'site'),
(0.006955607854199844, 'fix'),
(0.00691905899633175, 'request'),
(0.006399952284884274, 'suggestion'),
(0.005694276081227593, 'next'),
(0.005574388781942174, 'iteration'),
(0.0053908470928987155, 'loop'),
(0.005064950034618061, 'load'),
(0.005002775014107961, 'send')],
-2.253050772121795),

([(0.024545496709857575, 'install'),
(0.01786678939815506, 'python'),
(0.017745668124866384, 'self'),
(0.016724787229990948, 'package'),
(0.015618180606871198, 'folder'),
(0.01390659649233459, 'instal'),
(0.013335132307843339, 'r'),
(0.013210480218332963, 'c'),
(0.012548883568317423, 'xml'),
(0.01216600462702795, 'directory'),
(0.01015407214083796, 'path'),
(0.008122259217342347, 'engine'),
(0.007787437600078699, 'sdk'),
(0.007244828683763851, 'html'),
(0.007080322950206309, 'zip'),
(0.006816460772571588, 'class'),
(0.006728831708004009, 'plugin'),
(0.006392274476091252, 'c_program'),
(0.00635033805846674, 'object'),
(0.006013830355238557, 'script')],
-2.2585019080653064),

([(0.024659284232081496, 'driver'),
(0.011270367782498587, 'fix'),
(0.008946172622370054, 'hive'),
(0.008903154249793094, 'query'),
(0.008558420786005948, 'sql_server'),
(0.007875338213237493, 'defect'),
(0.007745099306329093, 'behavior'),
(0.007504870201803241, 'odbc'),
(0.007414758641639832, 'prospect'),
(0.007203854366529731, 'oracle'),
(0.007057695452217622, 'alexp'),
(0.007045935826742067, 'appear'),
(0.006927480811936405, 'detail'),
(0.006905649861485216, 'confirm'),
(0.006828666667092499, 'hdfs'),
(0.006801562482748226, 'error_message'),
(0.006728591389620027, 'documentation'),
(0.006652813878068282, 'bug'),
(0.006223445392748984, 'fail'),
(0.00584415838450598, 'log')],
-2.3810551929036263),

([(0.031758965528068554, 'asset'),
(0.02120560563107027, 'search'),
(0.018082370300320283, 'metadata'),
(0.017181671858312594, 'promote'),
(0.01566719795269047, 'demo'),
(0.015293621646370026, 'loader'),
(0.014495601238535879, 'environment'),
(0.011481341473037828, 'source'),
(0.010053640175506261, 'reference'),
(0.009645354502835565, 'canvas'),
(0.009056033699112547, 'owner'),
(0.008628383771088686, 'description'),
(0.008024030837530904, 'functionality'),
(0.006933964969924857, 'tag'),
(0.006659074331679673, 'document'),
(0.006098293215902485, 'system'),
(0.006048104039401445, 'content'),
(0.005870741728806124, 'admin'),
(0.005788075425153498, 'publish'),
(0.005703387759056851, 'prod')],
-2.3865825428335468),

([(0.04942393810148589, 'grand_prix'),
(0.02491047581651823, 'image'),
(0.017455811192708934, 'race'),
(0.01615115890870063, 'final'),
(0.015906125843627637, 'city'),
(0.015896683142603542, 'win'),
(0.015703110588676208, 'event'),
(0.015342057665039235, 'compete'),
(0.015339381367534849, 'round'),
(0.013914600476887884, 'ticket'),
(0.013435255630428742, 'user_group'),
(0.011449937536932535, 'trial'),
(0.010579003085890262, 'driver'),
(0.007662748356969642, 'match'),
(0.007585118765414722, 'inspire'),
(0.007532944860979533, 'stage'),
(0.007391580670059886, 'contestant'),
(0.006776712987148872, 'solve'),
(0.006707066534228501, 'winner'),
(0.006590240187135858, 'competitor')],
-2.4185218305971214),

([(0.02047134108296696, 'default'),
(0.018989227768861354, 'window'),
(0.0170840958703854, 'tab'),
(0.015814793591120604, 'display'),
(0.015174418579969474, 'text'),
(0.014756840136650124, 'expression'),
(0.01440625692612525, 'formula'),
(0.012951323990154933, 'column'),
(0.011737907631913678, 'interface'),
(0.010818014853072988, 'setting'),
(0.010644297031936097, 'size'),
(0.010151009800716898, 'screen'),
(0.00985912936931363, 'color'),
(0.008615780310039454, 'formula_tool'),
(0.008487271348873235, 'icon'),
(0.008427522996014084, 'map'),
(0.008192985857073098, 'annotation'),
(0.008102375512627807, 'render'),
(0.008057803428295417, 'button'),
(0.007711310691824729, 'label')],
-2.4233215372355676),

([(0.024413632545300868, 'query'),
(0.01995603374894747, 'sql'),
(0.01664329495406438, 'cache'),
(0.014910256733614461, 'functionality'),
(0.013741918852647837, 'https_community'),
(0.010511589974902836, 'load'),
(0.010273205924103402, 'db'),
(0.009946224455696909, 'input_tool'),
(0.008456770623640487, 'batch_macro'),
(0.008423846126333416, 'size'),
(0.00811928466606048, 'best_alex'),
(0.008038399584775742, 'limit'),
(0.007914329598344289, 'request'),
(0.007864903883069664, 'product_idea'),
(0.0073961131032875965, 'memory'),
(0.007024018465519146, 'agree'),
(0.006890446344038307, 'roadmap'),
(0.006761656707479195, 'plan'),
(0.00659176740486827, 'idi_p'),
(0.0064387761790198495, 'comment')],
-2.4274838035584283),

([(0.02320856572858142, 'email'),
(0.020989964339043227, 'job'),
(0.016200339673233276, 'flow'),
(0.01398790470586787, 'container'),
(0.011043461988997435, 'send'),
(0.009778731618173326, 'control'),
(0.009542223144129763, 'message'),
(0.009472813641874973, 'scheduler'),
(0.008948035119530842, 'disable'),
(0.008546722865525866, 'canvas'),
(0.008390033075692398, 'event'),
(0.007810812320858166, 'schedule'),
(0.007444832809070567, 'log'),
(0.007073789312437564, 'module'),
(0.007019597769235246, 'tool_container'),
(0.006559405832579771, 'comment'),
(0.006074241503789276, 'execution'),
(0.005211194455515212, 'execute'),
(0.00467787101837695, 'path'),
(0.004364314612224897, 'attachment')],
-2.450978675787028),

([(0.06665800962342187, 'tableau'),
(0.057376892621773895, 'api'),
(0.018353050984508554, 'download_tool'),
(0.017995436978036915, 'request'),
(0.01701764340446542, 'url'),
(0.01577482094421849, 'tableau_server'),
(0.015383426437480996, 'salesforce'),
(0.012123611951174831, 'extract'),
(0.011203206336149093, 'publish'),
(0.010675425349768139, 'tde'),
(0.010575267728114852, 'connector'),
(0.010256876853115162, 'web'),
(0.008434725136922219, 'data_source'),
(0.008296321308519447, 'response'),
(0.007720550903913831, 'article'),
(0.007714263049381092, 'header'),
(0.007558523695571444, 'publish_to_tableau'),
(0.007535466127962625, 'pull'),
(0.007380845761257233, 'apis'),
(0.007174917619766898, 'application')],
-2.583902322391514),

([(0.06369381967122531, 'com'),
(0.03152075330584901, 'https'),
(0.03119725010171633, 'link'),
(0.026943384962905097, 'html'),
(0.02234204179077005, 'http'),
(0.01554187173938344, 'http_www'),
(0.015062105871916816, 'blog'),
(0.014217935940797113, 'https_www'),
(0.012483849683606342, 'microsoft'),
(0.012128598879016987, 'text'),
(0.011920192742918905, 'page'),
(0.009712201768922292, 'kit'),
(0.008985614236123048, 'sa'),
(0.008951883357857227, 'htm'),
(0.008096327439667511, 'site'),
(0.007938546730115492, 'software'),
(0.007625413654930265, 'sap'),
(0.007409083464900269, 'nice'),
(0.007164831816411496, 'twitter'),
(0.006957077460469511, 'guide')],
-2.6087174082506945),

([(0.0706921447402962, 'date'),
(0.05201292202050065, 'function'),
(0.039226998401143874, 'string'),
(0.02106683226390238, 'character'),
(0.020954387118662175, 'formula'),
(0.017645875273058227, 'convert'),
(0.01297307442273291, 'return'),
(0.011485481010413813, 'json'),
(0.011470698839382974, 'datetime'),
(0.010912881177127244, 'true'),
(0.010246369044114965, 'b'),
(0.010047080499148755, 'text'),
(0.0097305907238132, 'false'),
(0.008813046389350548, 'date_time'),
(0.008748469794279162, 'label'),
(0.008607225527480984, 'sharepoint'),
(0.008380575999835726, 'month'),
(0.008189512957736365, 'delimiter'),
(0.007511947118864771, 'c'),
(0.007100720793213718, 'formula_tool')],
-3.0002983767192415),

([(0.05308197722717732, 'spatial'),
(0.03574348000224324, 'map'),
(0.027175970447702754, 'location'),
(0.02294647018279071, 'distance'),
(0.018036792284405058, 'polygon'),
(0.01577737622313117, 'area'),
(0.014789113213357059, 'spatial_object'),
(0.012250661440205642, 'line'),
(0.011913185239509499, 'spatial_tool'),
(0.01062605325303253, 'store'),
(0.010070021021823854, 'trade_area'),
(0.008825127591926803, 'segment'),
(0.00823172841914915, 'mile'),
(0.00813278539150798, 'calculate'),
(0.00799272522992367, 'layer'),
(0.007124261409059375, 'fuzzy_match'),
(0.006655820338436589, 'grid'),
(0.006186494652726671, 'coordinate'),
(0.006123266625077015, 'centroid'),
(0.0061017099573550644, 'determine')],
-3.102858446713829),

([(0.03721718230896204, 'license'),
(0.02648351944300455, 'address'),
(0.013315170047721132, 'install'),
(0.01162167847634791, 'machine'),
(0.010676755247336793, 'cass'),
(0.010653426567294764, 'system'),
(0.00999560547856418, 'node'),
(0.00976910335386901, 'instal'),
(0.009740523527010787, 'mongodb'),
(0.008505446726042622, 'scale'),
(0.008344881667224788, 'core'),
(0.0074009993168286355, 'deployment'),
(0.007283976747644404, 'q'),
(0.006626330850762665, 'instance'),
(0.006175828207319153, 'designer'),
(0.006123121926560843, 'upgrade'),
(0.006062543451688333, 'geocoder'),
(0.005988684649067714, 'worker'),
(0.005629863956467686, 'mongo'),
(0.005581365691623821, 'engine')],
-3.272487325100349)]

Visualization

 

The following code uses the pyLDAvis package to create the really slick interactive visualization included below:

 

# Visualize the topics
vis = pyLDAvis.gensim.prepare(lda_model, corpus, communitydict, sort_topics = False)

pyLDAvis.save_html(vis, 'LDA_Visualization.html')


Evaluation and Interpretation

 

Because LDA is an unsupervised algorithm, there is not an inherent way to evaluate the model. Often, the best bet is to evaluate how the unsupervised model improves the task it is actually going to be applied to. Another option to assess topic coherence includes adding an extra word from a different topic to a group of words from a topic, or adding an extra (low probability) topic to a document, and playing "one of these things is not like the other." If you would like to read more about this method for evaluating topic models, there is a really great (and fabulously titled) article called Reading Tea Leaves: How Humans Interpret Topic Models.

 

To wrap up this blog post, we will run the text of the blog (that I've written up to this point) through the LDA model, and see what it's topic composition is.

 

# import article to evaluate
with open('blogtext.txt', 'r') as myfile:
    data=myfile.read().replace('\n', '')

# match pre-processing for LDA model
stripped_tokens = [x for x in re.sub('\s_\s', " ", re.sub(r'[\.]', " ", 
re.sub(r'[\s+]', " ", re.sub(r'[-]', "_", re.sub(r'[\[\]\\\"\'\|,:@#?!$"%&()*+=><^~`{};/@]', "",
data.lower()))))).split() if x != " "] # make bigrams data_bigrams = bigram[stripped_tokens] # lemmatize model data_lem = [token.lemma_ for token in lemmatizer(" ".join(document))] # convert to vector format bow_vector = id2word.doc2bow(flat_list) # apply LDA model lda_vector = lda[bow_vector]

 

The return: 

 

[(1, 0.22018786655297065), 
(2, 0.23611089509765562),
(10, 0.09713624298964861),
(11, 0.04807226742568623),
(13, 0.05680959617218571),
(15, 0.16072888140435992),
(21, 0.013551096319205215),
(25, 0.1670280249108196)]

 

As humans, we tend to want to apply labels to things. It's part of what helps create context for the world around us. Given this is a topic model, it might feel really intuitive and, well, obvious to immediately apply labels to the topics the LDA model has found. However, this is not always the most productive thing to do because the topics created by the LDA model are not necessarily human-readable, and they may mean more than what we interpret them to mean. Many applications of LDA models don't require topic labels, and rather compare articles based on similar compositions of the (unlabeled) topics.

 

With all of that in mind, I will leave the interpretation of these topics up to you. In the return for the LDA model of this article, the first number indicates the topic label (which corresponds to the topic numbers in the interactive visualization), and the second number is the relative proportion of words that belong to the topic in this post (e.g., Topic 2 (which I have interpreted as a predictive topic), makes up ~24% of this post).

 

What do you think? Does the topic distribution of this article make sense to you? Please remember to interpret and apply labels with care 🙂 

 

If you would like additional code-based examples of LDA, there are a couple of really cool step-by-step examples of LDA in Python with Gensim that you can work through here and here. If you’re more of an R soul, there are a couple of great resources for LDA topic modeling here and here.

Sydney Firmin

A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. She currently manages a team of data scientists that bring new innovations to the Alteryx Platform.

A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. She currently manages a team of data scientists that bring new innovations to the Alteryx Platform.

Comments
suli
9 - Comet

Hi Sydney,

 

Great post. I have been looking for LDA solution in Alteryx. Do you know when it will be available as a standalone Alteryx tool? 

 

Regards,

 

PS Can vote for LDA tool in Ideas section here: https://community.alteryx.com/t5/Alteryx-Designer-Ideas/Text-mining-topic-modeling-Latent-Dirichlet-...

SydneyF
Alteryx Alumni (Retired)

Hi @suli,

 

Posting to the ideas board is a great way to get your requests added to the product roadmap! I went ahead and starred your post - the more stars your idea gets, the more likely it is to be considered by a product manager. In the meantime, I would suggest leveraging the Python tool (or R tool) and some custom code. There are a bunch of great resources across the internet on LDA with the Gensim python package to help you get started. 

 

Thanks,

 

Sydney

 

Topic Modeling for Fun and Profit

Pre-processing and training LDA

LDA Tutorial in Gensim Documentation