Showing results for 
Search instead for 
Did you mean: 

Data Science Blog

Machine learning & data science for beginners and experts alike.
Community Content Engineer
Community Content Engineer

Understanding the topic of a piece of writing is typically an easy task for people. Based on the title of this article, or the context of what type of stuff I have been writing lately, you may have already deduced this blog post is going to be about Natural Language Processing on the many posts and articles of the Alteryx Community. Part of what makes topic deduction easier for people is context. By living our rich and thrilling lives, we gain context on the ways of the world and the types of things that get written about. Understanding the topic of something is a little trickier for computers, which tend to live in boxes and lack the rich context and connections that people have.


However, there are times where we need to train our computers to find topics in a collection of documents. There might be too many documents for you, a single human, to read through, or you may be interested in discovering underlying themes in a large set of texts.




LDA Overview


Enter Latent Dirichlet (pronounced something like “Deer-ish Sleigh”) Allocation, a popular model for Topic Modeling. Latent Dirichlet Allocation (LDA) is a Bayesian network that models how documents in a corpus are topically related. LDA is a way to cluster discrete data where each observation can belong to more than one cluster. It is an unsupervised machine learning algorithm. 


Before we get into how the model works, let's frame this article with the following definitions. Documents (articles, posts, etc.) are made up of topics. These topics are made up of words. Documents can be made up of any combination of topics, where each topic is represented as probability distributions over a set of words.


In the mind of an LDA model, documents are written by first determining what topics the article is going to be written about as a percentage break-down (e.g., 20% Python, 40% NLP, 10% Puppies, and 30% Alteryx Community), and then filling up the document with words (until the specified length of the document is reached) that belong to each topic. For an LDA model, context doesn’t matter, only the distribution of words. Each document in a corpus is effectively a bag of words.


Given how an LDA model thinks a document is written, we can think about how it creates topic models. LDA attempts to work backwards based on this generative model to identify the topics that were used to generate the corpus. 


To enable the LDA model to “solve backwards,” we need to give it a few parameters to go on. We need to supply it with the number of topics it is creating, as well as a beta value and an alpha value, where the beta value is the parameter of the uniform Dirichlet prior on a per-topic word distribution, and the alpha is the parameter of the uniform Dirichlet prior for the per-document topic distribution. If that all seemed like gibberish, don’t worry too much. What you need to know is that a high alpha makes documents appear more similar to one another (meaning that each document will be a mixture of topics), and a low alpha makes documents more homogenous (containing high proportions of fewer topics). Similarly, a high beta makes topics appear more similar to each other by making each topic a mixture of most of the words in the corpus, where a low beta will make each topic a mixture of just a few of the words. If you would like to understand the math a little better, there are nice explanations of Dirichlet distributions here and here.


The structure of LDA is often expressed in Plate Notation, which is a way to represent a graphical model as an illustration, where groups of variables are repeated together.


In this visualization, the arrows indicate dependencies (e.g., the word used is dependent on beta and the topic of the word). The shaded circle indicates that a variable is observable, and the empty circles indicate a latent variable.




Once we have provided the necessary parameters, the LDA model kicks off by randomly assigning all of the words in each document to one of k (how every many you specified) topics. It assumes that all topics and words in the model are correct except for the one it is working on refining. While working on a topic, it calculates the proportion of words in the topics that are currently assigned to a given topic and calculates the proportion of assignments over all documents that come from a given word (the probability that topic t generated word w). It shifts words around topics until a stable state is reached where the assignments make sense. And it works on refining the topics in the documents one at a time, rearranging the words to get a better fit.


If this overview isn’t doing it for you, there is a collection of helpful resources scattered across the internet. There is this overview of the LDA Algorithm, a slightly longer video with an applied example, and an even longer lecture from a professor at CU Boulder (Sko Buffs!). If you’re in the mood for a written document, there is a fun article that describes LDA with emojis from Medium.


Topic Modeling on the Community


For Topic Modeling on the Community, we will be using the Python Gensim module. In addition to Gensim, we will be using spaCy for its lemmatization feature, nltk for a list of stopwords to remove from our texts, numpy, and a really neat package called pyLDAvis to create a visualization at the end.


The first step is to import all the required packages.


import os
import re
import numpy as np

import gensim
import gensim.corpora as corpora

#spacy for lemmatization
import spacy

#nltk stopwords
from nltk.corpus import stopwords
#plotting tools import pyLDAvis.gensim from pprint import pprint


We can read in the corpus as a Sentence stream, reading in all of the .txt files from a directory, where each new line is a new document. With this code, each document is tokenized (i.e., split into a list of words), stripping punctuation and making all letters lowercase. 


class Sentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname), encoding = "latin-1"):
                yield [x for x in re.sub('\s_\s', " ", re.sub(r'[\.]', " ", re.sub(r'[\s+]', " ", re.sub(r'[-]', "_", re.sub(r'[\"\'\|,:@#?!$"%&()*+=><^~`{};/@]', "", line.lower()))))).split() if x != " "]

sentences = Sentences("C:\\Users\\CommunityPosts\\")


Pre-processing for LDA is particularly important because each document is considered to be a collection of words and each word an individual data point. LDA does not consider the order or grammar of words. Without preprocessing, LDA will recognize Help, help, Helps, helps, HELP, HELP!!!!, and helping all as completely distinct words.


In addition to making all characters lowercase and stripping punctuation, we will be applying a collocation (phrases) model (featured in a recent data science blog post), conducting lemmatizationand filtering out frequently used words. 


The bigram model is trained on the input data set using the Phrases function from the Gensim package. In this application, I loaded a Phrases model that I had previously trained on the Community corpus and then defined a function that applies the model to the tokenized documents. 


# phrases model and function
bigram = gensim.models.Phrases.load('bigram.model') def make_bigrams(texts): return[bigram[doc] for doc in texts]


Lemmatization is the process of grouping inflected forms of a word so that they can be analyzed as one word or concept identified as the word's lemma. This is preferable to the similar process stemming because lemmatization tends to produce more readable results. Because the words in the corpus will be used to define the topics, it is important to have interpretable and readable words. 


# Lemmatization with spaCy

def lemmatization(texts, allowed_postages=['NOUN', 'ADJ', 'VERB', 'ADV']): """""" texts_out=[] for sent in texts: doc=nlp(" ".join(sent)) texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postages]) return texts_out


Stop words should be removed because they don't carry any important information about documents. In addition to the default stopwords from nltk, I added a short list of words relevant to the Community.


# nltk stopwords
stop_words = stopwords.words('english')
stop_words.extend(['use', 'thank', 'get', 'see', 'look', 'know', 'hi', 'hello', 'thanks', 'be', 'have', 'help', 'make',
                   'not', 'anyone_seen', '-PRON-', 'nbsp', 'alteryx', '_', '|'])

def remove_stopwords(texts):
    return [[word for word in doc if word not in stop_words] for doc in texts]


After defining each of the preprocessing functions, we can apply them to our corpus, resulting in a clean, pre-processed text dataset.  


# make bigrams
data_bigrams = make_bigrams(sentences)

# spacy 'en' model, keeping only tagger component
nlp = spacy.load('en', disable= ['parser', 'ner'])

# lemmatize model
data_lem = lemmatization(data_bigrams, allowed_postages=['NOUN', 'ADJ', 'VERB', 'ADV'])

# remove stopwords
data_final = remove_stopwords(data_lem)

# remove empty documents
texts = [x for x in data_final if x != []]


Now we can create a dictionary representation of the documents. A Dictionary object maps the text tokens (words) to their numerical IDs. This is a necessary step for implementation because most algorithms rely on numerical libraries that work with vectors indexed by integers, with known vector/matrix dimensionality.


# create dictionary
id2word = corpora.Dictionary(texts) 


The dictionary will contain all of the words that appear in the corpus, along with how many times they appeared. Once the corpus is in a dictionary, we can filter very rare or common words. If a word occurs in 80-90% of the documents, or in a very small subset of documents, it is probably not helpful for identifying a topic. In this code, we are filtering words that occur in fewer than 20 documents, or that occur in more than 10% of documents. We do this in addition to filtering stopwords to ensure we are only getting meaningful, relevant words for topic definitions.


# filter words - remove rare and common tokens
id2word.filter_extremes(no_below=20, no_above=0.1)


After all of our filtering, we end up with just under 4000 words to create topics with.


Dictionary(3982 unique tokens: ['accept', 'additional', 'address', 'alteryx_designer', 'aren_t']...)


As the last step before training the LDA model, we need to transform the documents into a vectorized, bag of words (bow) format, using the Gensim doc2bow() function and a list comprehension:


# term document frequency 
corpus = [id2word.doc2bow(text) for text in texts]


Now that we have finished data pre-processing, we can train our LDA model!


# build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=25, random_state=42, 
                                            update_every=1, chunksize=100, passes=50, alpha='auto', 
                                            per_word_topics=True, eval_every=eval_every, dtype=np.float64)


Now lets examine the topics that the LDA model found, as well as topic coherence:


# gather topics and top words
top_topics = lda_model.top_topics(corpus, topn=20)

# calculate average topic coherence
avg_topic_coherence = sum([t[1] for t in top_topics])/num_topics
print('Average topic coherence: %.4f' % avg_topic_coherence)

# print results in an attractive format


The printed top 20 words for each topic:


[([(0.020695580423620193, 'row'),
(0.020011242784347297, 'formula'),
(0.018599296583477785, 'column'),
(0.013588141693289426, 'regex'),
(0.01033067729605757, 'parse'),
(0.010259747704317868, 'fun'),
(0.009542918811638176, 'solution_to_read'),
(0.009325463434261069, 'multi_row'),
(0.00893292850468382, 'formula_tool'),
(0.008684688598017362, 'solve'),
(0.008338891165488293, 'approach'),
(0.008097284494434675, 'filter'),
(0.00780559544745668, 'here_is_my_solution'),
(0.007406627197261226, 'nice'),
(0.0070931028525079536, 'bit'),
(0.006705337772992282, 'match'),
(0.0066906038412741205, 'read_to_read'),
(0.006120161533681971, 'heres_my_solution'),
(0.006071555498838694, 'learn'),
(0.005686126404121368, 'iterative_macro')],

([(0.023529394935676337, 'analyst'),
(0.023293772962139647, 'organization'),
(0.02074950702865459, 'big_data'),
(0.013074900906929829, 'insight'),
(0.012674802447542257, 'analysis'),
(0.012103346278365243, 'data_blend'),
(0.012068491149929726, 'deliver'),
(0.011564303190761377, 'predictive_analytic'),
(0.0094988304433296, 'advanced_analytic'),
(0.008610070755130156, 'today'),
(0.00819131421714101, 'tableau'),
(0.008087252822162536, 'visualization'),
(0.00733170057763311, 'blend'),
(0.0068817112025207645, 'self_service'),
(0.006730189006778379, 'market'),
(0.006678768115331216, 'line_of_business'),
(0.006631729356219946, 'webinar'),
(0.006357341376114067, 'platform'),
(0.006309106114531908, 'enable'),
(0.006306421730745754, 'data_scientist')],

([(0.028397373495646576, 'inspire'),
(0.01753733515424845, 'session'),
(0.009493662577449507, 'conference'),
(0.0074649721965511125, 'event'),
(0.007114559996494287, 'people'),
(0.006712520834152984, 'learn'),
(0.006039972992989308, 'user_group'),
(0.005917436693343427, 'ace'),
(0.005698164530144839, 'love'),
(0.0055362773389305816, 'pm'),
(0.004814769643636668, 'opportunity'),
(0.004803084409792049, 'attendee'),
(0.004776088161133497, 'meet'),
(0.004694738527392643, 'hear'),
(0.004447416654816952, 'track'),
(0.004423394767311015, 'story'),
(0.004388328760054483, 'world'),
(0.004260335477287717, 'blog'),
(0.004216587514543972, 'attend'),
(0.004140876625267237, 'partner')],

([(0.04582962916837095, 'model'),
(0.017801121912602684, 'r'),
(0.0166821773075764, 'variable'),
(0.009665793275730968, 'predict'),
(0.008406022738666993, 'sample'),
(0.008090882943418256, 'dataset'),
(0.008083387569849778, 'score'),
(0.007792287042981745, 'algorithm'),
(0.007045401610432731, 'county'),
(0.006704513469866865, 'plot'),
(0.006655225993652604, 'analysis'),
(0.006461535952912632, 'predictive'),
(0.006436212075801692, 'estimate'),
(0.0057658976121342555, 'measure'),
(0.005606005268008011, 'prediction'),
(0.005597889110967721, 'compare'),
(0.005439905883753477, 'figure'),
(0.005316314897221247, 'method'),
(0.005182810234879585, 'function'),
(0.005119507617503412, 'fit')],

([(0.016435416683044362, 'platform'),
(0.011939336990657867, 'capability'),
(0.01182527739193118, 'project'),
(0.01137750863295518, 'system'),
(0.010172089523477552, 'spark'),
(0.007726167805147196, 'enterprise'),
(0.00760682833419796, 'data_source'),
(0.007526056029785542, 'application'),
(0.007498725999310362, 'release'),
(0.007461574875805607, 'integration'),
(0.0073297643148498215, 'software'),
(0.007235374944897183, 'scale'),
(0.006716181728931385, 'develop'),
(0.006400847988724821, 'enable'),
(0.006381861096018541, 'environment'),
(0.0058660035327072636, 'benefit'),
(0.005836886387815695, 'development'),
(0.005793268613330379, 'alteryx_designer'),
(0.005653141584067736, 'addition'),
(0.005529361747751699, 'analyst')],

([(0.012847527264999938, 'achieve'),
(0.01242290347576571, 'analysis'),
(0.012023107606175467, 'month'),
(0.009944791535134894, 'project'),
(0.009455050549353035, 'dashboard'),
(0.009441557862967754, 'week'),
(0.009334705908028765, 'employee'),
(0.00882888310107117, 'describe_the_benefit'),
(0.008816891219323915, 'excel'),
(0.008672456183038807, 'hour'),
(0.007560416660977143, 'daily'),
(0.007021011232387925, 'system'),
(0.006395068490289318, 'forecast'),
(0.0061956780656895775, 'identify'),
(0.006119358626399816, 'tableau'),
(0.0058087875387583835, 'metric'),
(0.0057460452372290315, 'visualization'),
(0.005389743293058159, 'needed_to_solve'),
(0.005389737583023549, 'describe_the_work'),
(0.0052557832159785415, 'level')],

([(0.019288731425265434, 'exam'),
(0.019233528455285892, 'training'),
(0.015022157539702112, 'student'),
(0.014328955342406169, 'video'),
(0.013779646291934091, 'learn'),
(0.013529429896529171, 'certification'),
(0.012711780634151357, 'course'),
(0.012433822408172989, 'review'),
(0.011349657095873635, 'program'),
(0.010671828902499335, 'pass'),
(0.010129205244743308, 'partner'),
(0.009885398897719236, 'resource'),
(0.008603037213350363, 'designer'),
(0.008566844865150722, 'entry'),
(0.008478492741067089, 'alteryx_designer'),
(0.0065736806320999385, 'guide'),
(0.006572841959007066, 'section'),
(0.006347083619906077, 'content'),
(0.0062897452072069185, 'receive'),
(0.006056886990451579, 'project')],

([(0.022061384916288795, 'column'),
(0.01506781065568456, 'filter'),
(0.010901909596177942, 'null'),
(0.010184214446298063, 'row'),
(0.009825100861470537, 'property'),
(0.0096752280391286, 'box'),
(0.009039938646575745, 'return'),
(0.00870333746467137, 'stream'),
(0.008113928146452686, 'match'),
(0.007238700996815709, 'specify'),
(0.007175289201779777, 'sort'),
(0.0068436021829196845, 'count'),
(0.006643137058434104, 'remove'),
(0.006609407558053242, 'rename'),
(0.006370051182194785, 'choose'),
(0.006369594234147002, 'drop_down'),
(0.006309660514046302, 'default'),
(0.005837265779177482, 'append'),
(0.00565126404674956, 'replace'),
(0.005613422932795999, 'apply')],

([(0.013555887514475881, 'insight'),
(0.012030134358668891, 'sale'),
(0.009743029947684622, 'marketing'),
(0.00846782553470163, 'drive'),
(0.008359318644964689, 'store'),
(0.007673901230956347, 'service'),
(0.007602726307965364, 'improve'),
(0.006852587734199537, 'retail'),
(0.0063246859047269375, 'lead'),
(0.005869905240850228, 'tableau'),
(0.005794704264613193, 'analysis'),
(0.005790765584248087, 'industry'),
(0.005786696319230313, 'survey'),
(0.005377876827751532, 'qlik'),
(0.005324660387428519, 'decision'),
(0.005196028980445149, 'retailer'),
(0.005078400860568244, 'target'),
(0.004994194365520417, 'hour'),
(0.004965243436498029, 'impact'),
(0.004914220593379361, 'identify')],

([(0.027895288990154428, 'rank'),
(0.020017709062860975, 'weekly_challenge'),
(0.015426679699583229, 'week'),
(0.014881599576766805, 'welcome'),
(0.011939536765824732, 'love'),
(0.011620230044398001, 'seanadam'),
(0.011456735159730128, 'awesome'),
(0.011183575166816954, 'badge'),
(0.010509118841686451, 'top'),
(0.010110667462712825, 'patrick_digan'),
(0.009032136645339724, 'intermediate'),
(0.008790114318587821, 'ranking'),
(0.00808137117998476, 'lordneillord'),
(0.007633131862798772, 'everyone'),
(0.007493915379577319, 'analysis'),
(0.007282216928886164, 'range'),
(0.0072166391600903945, 'solve'),
(0.0069630130481704265, 'tableau'),
(0.0069372701948944146, 'generate_row'),
(0.006628733935085583, 'learn')],

([(0.02282930627361656, 'log'),
(0.021290932339173358, 'credential'),
(0.018666786982149318, 'schedule'),
(0.016722711097457068, 'password'),
(0.015931181013013055, 'service'),
(0.015082916244396406, 'designer'),
(0.01375384352871289, 'account'),
(0.010572146355710677, 'window'),
(0.010556573243958955, 'setting'),
(0.010314391908109817, 'admin'),
(0.010026541153535297, 'permission'),
(0.00970915856077299, 'default'),
(0.009202573402148926, 'authentication'),
(0.008927864917945658, 'security'),
(0.008845819783475515, 'collection'),
(0.008602824336617541, 'port'),
(0.008571780912760712, 'machine'),
(0.00814380320750843, 'administrator'),
(0.007836514858975643, 'system'),
(0.007248994014152426, 'scheduler')],

([(0.01802969309662748, 'word'),
(0.017863542201158964, 'module'),
(0.016606304535601107, 'excel'),
(0.008113860546990611, 'sheet'),
(0.007714503263499146, 'letter'),
(0.006281641923357433, 'actually'),
(0.006192337984885741, 'cell'),
(0.005895214074471618, 'browse_tool'),
(0.00589002249826699, 'always'),
(0.0057877700740173385, 'wizard'),
(0.0054475115528324165, 'little'),
(0.005433975323120903, 'step'),
(0.005103032953959196, 'browse'),
(0.004470599360101434, 'excel_file'),
(0.004394869429962214, 'love'),
(0.004184357069916334, 'big'),
(0.004141542195552999, 'import'),
(0.004135197847252612, 'count'),
(0.0041286903438157805, 'next'),
(0.00403341054881167, 'line')],

([(0.03904195664475774, 'connector'),
(0.027253597087080347, 'release'),
(0.025221246928646968, 'page'),
(0.016437391025590206, 'feedback'),
(0.012711546769212946, 'profile'),
(0.011181633281273603, 'beta'),
(0.010297662448368114, 'ill'),
(0.009107671858798487, 'link'),
(0.008319855635100333, 'content'),
(0.007495004190944284, 'message'),
(0.007378890245776566, 'search'),
(0.007110820252288945, 'site'),
(0.006955607854199844, 'fix'),
(0.00691905899633175, 'request'),
(0.006399952284884274, 'suggestion'),
(0.005694276081227593, 'next'),
(0.005574388781942174, 'iteration'),
(0.0053908470928987155, 'loop'),
(0.005064950034618061, 'load'),
(0.005002775014107961, 'send')],

([(0.024545496709857575, 'install'),
(0.01786678939815506, 'python'),
(0.017745668124866384, 'self'),
(0.016724787229990948, 'package'),
(0.015618180606871198, 'folder'),
(0.01390659649233459, 'instal'),
(0.013335132307843339, 'r'),
(0.013210480218332963, 'c'),
(0.012548883568317423, 'xml'),
(0.01216600462702795, 'directory'),
(0.01015407214083796, 'path'),
(0.008122259217342347, 'engine'),
(0.007787437600078699, 'sdk'),
(0.007244828683763851, 'html'),
(0.007080322950206309, 'zip'),
(0.006816460772571588, 'class'),
(0.006728831708004009, 'plugin'),
(0.006392274476091252, 'c_program'),
(0.00635033805846674, 'object'),
(0.006013830355238557, 'script')],

([(0.024659284232081496, 'driver'),
(0.011270367782498587, 'fix'),
(0.008946172622370054, 'hive'),
(0.008903154249793094, 'query'),
(0.008558420786005948, 'sql_server'),
(0.007875338213237493, 'defect'),
(0.007745099306329093, 'behavior'),
(0.007504870201803241, 'odbc'),
(0.007414758641639832, 'prospect'),
(0.007203854366529731, 'oracle'),
(0.007057695452217622, 'alexp'),
(0.007045935826742067, 'appear'),
(0.006927480811936405, 'detail'),
(0.006905649861485216, 'confirm'),
(0.006828666667092499, 'hdfs'),
(0.006801562482748226, 'error_message'),
(0.006728591389620027, 'documentation'),
(0.006652813878068282, 'bug'),
(0.006223445392748984, 'fail'),
(0.00584415838450598, 'log')],

([(0.031758965528068554, 'asset'),
(0.02120560563107027, 'search'),
(0.018082370300320283, 'metadata'),
(0.017181671858312594, 'promote'),
(0.01566719795269047, 'demo'),
(0.015293621646370026, 'loader'),
(0.014495601238535879, 'environment'),
(0.011481341473037828, 'source'),
(0.010053640175506261, 'reference'),
(0.009645354502835565, 'canvas'),
(0.009056033699112547, 'owner'),
(0.008628383771088686, 'description'),
(0.008024030837530904, 'functionality'),
(0.006933964969924857, 'tag'),
(0.006659074331679673, 'document'),
(0.006098293215902485, 'system'),
(0.006048104039401445, 'content'),
(0.005870741728806124, 'admin'),
(0.005788075425153498, 'publish'),
(0.005703387759056851, 'prod')],

([(0.04942393810148589, 'grand_prix'),
(0.02491047581651823, 'image'),
(0.017455811192708934, 'race'),
(0.01615115890870063, 'final'),
(0.015906125843627637, 'city'),
(0.015896683142603542, 'win'),
(0.015703110588676208, 'event'),
(0.015342057665039235, 'compete'),
(0.015339381367534849, 'round'),
(0.013914600476887884, 'ticket'),
(0.013435255630428742, 'user_group'),
(0.011449937536932535, 'trial'),
(0.010579003085890262, 'driver'),
(0.007662748356969642, 'match'),
(0.007585118765414722, 'inspire'),
(0.007532944860979533, 'stage'),
(0.007391580670059886, 'contestant'),
(0.006776712987148872, 'solve'),
(0.006707066534228501, 'winner'),
(0.006590240187135858, 'competitor')],

([(0.02047134108296696, 'default'),
(0.018989227768861354, 'window'),
(0.0170840958703854, 'tab'),
(0.015814793591120604, 'display'),
(0.015174418579969474, 'text'),
(0.014756840136650124, 'expression'),
(0.01440625692612525, 'formula'),
(0.012951323990154933, 'column'),
(0.011737907631913678, 'interface'),
(0.010818014853072988, 'setting'),
(0.010644297031936097, 'size'),
(0.010151009800716898, 'screen'),
(0.00985912936931363, 'color'),
(0.008615780310039454, 'formula_tool'),
(0.008487271348873235, 'icon'),
(0.008427522996014084, 'map'),
(0.008192985857073098, 'annotation'),
(0.008102375512627807, 'render'),
(0.008057803428295417, 'button'),
(0.007711310691824729, 'label')],

([(0.024413632545300868, 'query'),
(0.01995603374894747, 'sql'),
(0.01664329495406438, 'cache'),
(0.014910256733614461, 'functionality'),
(0.013741918852647837, 'https_community'),
(0.010511589974902836, 'load'),
(0.010273205924103402, 'db'),
(0.009946224455696909, 'input_tool'),
(0.008456770623640487, 'batch_macro'),
(0.008423846126333416, 'size'),
(0.00811928466606048, 'best_alex'),
(0.008038399584775742, 'limit'),
(0.007914329598344289, 'request'),
(0.007864903883069664, 'product_idea'),
(0.0073961131032875965, 'memory'),
(0.007024018465519146, 'agree'),
(0.006890446344038307, 'roadmap'),
(0.006761656707479195, 'plan'),
(0.00659176740486827, 'idi_p'),
(0.0064387761790198495, 'comment')],

([(0.02320856572858142, 'email'),
(0.020989964339043227, 'job'),
(0.016200339673233276, 'flow'),
(0.01398790470586787, 'container'),
(0.011043461988997435, 'send'),
(0.009778731618173326, 'control'),
(0.009542223144129763, 'message'),
(0.009472813641874973, 'scheduler'),
(0.008948035119530842, 'disable'),
(0.008546722865525866, 'canvas'),
(0.008390033075692398, 'event'),
(0.007810812320858166, 'schedule'),
(0.007444832809070567, 'log'),
(0.007073789312437564, 'module'),
(0.007019597769235246, 'tool_container'),
(0.006559405832579771, 'comment'),
(0.006074241503789276, 'execution'),
(0.005211194455515212, 'execute'),
(0.00467787101837695, 'path'),
(0.004364314612224897, 'attachment')],

([(0.06665800962342187, 'tableau'),
(0.057376892621773895, 'api'),
(0.018353050984508554, 'download_tool'),
(0.017995436978036915, 'request'),
(0.01701764340446542, 'url'),
(0.01577482094421849, 'tableau_server'),
(0.015383426437480996, 'salesforce'),
(0.012123611951174831, 'extract'),
(0.011203206336149093, 'publish'),
(0.010675425349768139, 'tde'),
(0.010575267728114852, 'connector'),
(0.010256876853115162, 'web'),
(0.008434725136922219, 'data_source'),
(0.008296321308519447, 'response'),
(0.007720550903913831, 'article'),
(0.007714263049381092, 'header'),
(0.007558523695571444, 'publish_to_tableau'),
(0.007535466127962625, 'pull'),
(0.007380845761257233, 'apis'),
(0.007174917619766898, 'application')],

([(0.06369381967122531, 'com'),
(0.03152075330584901, 'https'),
(0.03119725010171633, 'link'),
(0.026943384962905097, 'html'),
(0.02234204179077005, 'http'),
(0.01554187173938344, 'http_www'),
(0.015062105871916816, 'blog'),
(0.014217935940797113, 'https_www'),
(0.012483849683606342, 'microsoft'),
(0.012128598879016987, 'text'),
(0.011920192742918905, 'page'),
(0.009712201768922292, 'kit'),
(0.008985614236123048, 'sa'),
(0.008951883357857227, 'htm'),
(0.008096327439667511, 'site'),
(0.007938546730115492, 'software'),
(0.007625413654930265, 'sap'),
(0.007409083464900269, 'nice'),
(0.007164831816411496, 'twitter'),
(0.006957077460469511, 'guide')],

([(0.0706921447402962, 'date'),
(0.05201292202050065, 'function'),
(0.039226998401143874, 'string'),
(0.02106683226390238, 'character'),
(0.020954387118662175, 'formula'),
(0.017645875273058227, 'convert'),
(0.01297307442273291, 'return'),
(0.011485481010413813, 'json'),
(0.011470698839382974, 'datetime'),
(0.010912881177127244, 'true'),
(0.010246369044114965, 'b'),
(0.010047080499148755, 'text'),
(0.0097305907238132, 'false'),
(0.008813046389350548, 'date_time'),
(0.008748469794279162, 'label'),
(0.008607225527480984, 'sharepoint'),
(0.008380575999835726, 'month'),
(0.008189512957736365, 'delimiter'),
(0.007511947118864771, 'c'),
(0.007100720793213718, 'formula_tool')],

([(0.05308197722717732, 'spatial'),
(0.03574348000224324, 'map'),
(0.027175970447702754, 'location'),
(0.02294647018279071, 'distance'),
(0.018036792284405058, 'polygon'),
(0.01577737622313117, 'area'),
(0.014789113213357059, 'spatial_object'),
(0.012250661440205642, 'line'),
(0.011913185239509499, 'spatial_tool'),
(0.01062605325303253, 'store'),
(0.010070021021823854, 'trade_area'),
(0.008825127591926803, 'segment'),
(0.00823172841914915, 'mile'),
(0.00813278539150798, 'calculate'),
(0.00799272522992367, 'layer'),
(0.007124261409059375, 'fuzzy_match'),
(0.006655820338436589, 'grid'),
(0.006186494652726671, 'coordinate'),
(0.006123266625077015, 'centroid'),
(0.0061017099573550644, 'determine')],

([(0.03721718230896204, 'license'),
(0.02648351944300455, 'address'),
(0.013315170047721132, 'install'),
(0.01162167847634791, 'machine'),
(0.010676755247336793, 'cass'),
(0.010653426567294764, 'system'),
(0.00999560547856418, 'node'),
(0.00976910335386901, 'instal'),
(0.009740523527010787, 'mongodb'),
(0.008505446726042622, 'scale'),
(0.008344881667224788, 'core'),
(0.0074009993168286355, 'deployment'),
(0.007283976747644404, 'q'),
(0.006626330850762665, 'instance'),
(0.006175828207319153, 'designer'),
(0.006123121926560843, 'upgrade'),
(0.006062543451688333, 'geocoder'),
(0.005988684649067714, 'worker'),
(0.005629863956467686, 'mongo'),
(0.005581365691623821, 'engine')],



The following code creates the really slick interactive visualization included below it:


# Visualize the topics
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)

pyLDAvis.save_html(vis, 'LDA_Visualization.html')

Evaluation and Interpretation


Because LDA is an unsupervised algorithm, there is not an inherent way to evaluate the model. Often, the best bet is to evaluate how the unsupervised model improves the task it is actually going to be applied to. Another option to assess topic coherence includes adding an extra word from a different topic to a group of words from a topic, or adding an extra (low probability) topic to a document, and playing "one of these things is not like the other." If you would like to read more about this method for evaluating topic models, there is a really great (and fabulously titled) article called Reading Tea Leaves: How Humans Interpret Topic Models.


To wrap up this blog post, we will run the text of the blog (that I've written up to this point) through the LDA model, and see what it's topic composition is.


# import article to evaluate
with open('blogtext.txt', 'r') as myfile:'\n', '')

# match pre-processing for LDA model
stripped_tokens = [x for x in re.sub('\s_\s', " ", re.sub(r'[\.]', " ", 
re.sub(r'[\s+]', " ", re.sub(r'[-]', "_", re.sub(r'[\[\]\\\"\'\|,:@#?!$"%&()*+=><^~`{};/@]', "",
data.lower()))))).split() if x != " "] # make bigrams data_bigrams = bigram[stripped_tokens] # lemmatize model data_lem = lemmatization(data_bigrams, allowed_postages=['NOUN', 'ADJ', 'VERB', 'ADV']) # flatten document flat_list = [item for sublist in texts for item in sublist] # convert to vector format bow_vector = id2word.doc2bow(flat_list) # apply LDA model lda_vector = lda[bow_vector]


The return: 


[(2, 0.10312072882784949), 
(3, 0.3854467963314416),
(5, 0.3340693473637106),
(12, 0.09771771437274225),
(16, 0.05396271026296574),
(22, 0.01582212076466559)]


As humans, we tend to want to apply labels to things. It's part of what helps create context for the world around us. Given this is a topic model, it might feel really intuitive and, well, obvious to immediately apply labels to the topics the LDA model has found. However, this is not always the most productive thing to do because the topics created by the LDA model are not necessarily human-readable, and they may mean more than what we interpret them to mean. Many applications of LDA models don't require topic labels, and rather compare articles based on similar compositions of the (unlabeled) topics.


With all of that in mind, I will leave the interpretation of these topics up to you. In the return for the LDA model of this article, the first number indicates the topic label (which corresponds to the topic numbers in the interactive visualization), and the second number is the relative proportion of words that belong to the topic in this post (e.g., Topic 3 (which I have interpreted as a predictive topic), makes up ~39% of this post).


What do you think? Does the topic distribution of this article make sense to you? Please remember to interpret and apply labels with care Smiley Happy 


If you would like additional code-based examples of LDA, there are a couple of really cool step-by-step examples of LDA in Python with Gensim that you can work through here and here. If you’re more of an R soul, there are a couple of great resources for LDA topic modeling here and here.

Sydney Firmin

A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. In her current role as a Community Content Engineer, she gets to spend her days doing what she loves best; transforming technical knowledge and research into engaging, creative, and fun content for the Alteryx Community.

A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. In her current role as a Community Content Engineer, she gets to spend her days doing what she loves best; transforming technical knowledge and research into engaging, creative, and fun content for the Alteryx Community.