Data Science

GregL · ‎12-11-2014

Naive Bayes classification is a simple, yet effective algorithm. It's commonly used in things like text analytics and works well on both small datasets and massively scaled out, distributed systems.

How does it work?

Naive Bayes is based on, you guessed it, Bayes' theorem. Think back to your first statistics class. Bayes' theorem was that seemingly counterintuitive lecture on conditional probability.

I've tried to buy this on multiple occasions, but to no avail :(. Please email me if you know where you can buy it

Bayes' Theorem

The neon formula above might look intimidating, but it's actually not that complicated. To explain it, instead of using "events A and B", I'm going to use something a little more familiar. Let's say the two events in question are:

A) I watched The Lego Movie today
B) I sat on the couch today

So for my 2 events, let's break it down into it's Bayesian components:

I've seen The Lego Movie 10 times in the past 2 months--this is a lot, I know. I've been lucky enough that it's been playing on almost every plane I've been on (as it should be! it's a great movie for both adults and kids). Since I've watched The Lego Movie 10 out of the last 60 days, we'll say that:

P(A) = P(I watched The Lego Movie today) = 10 / 60, or ~0.17

I sit on the couch most days I'm at my apartment. I've traveled 14 days in the past 2 months, so to keep it simple, we'll assume I sat on my couch at least once on every other day (hey, it's pretty comfy).

P(B) = P(I sat on the couch today) = (60 - 14) / 60, or ~0.76

I've seen The Lego Movie 10 times and 4 of those times have been on a plane. I think it's pretty safe to assume the rest of those times I was seated comfortably in my living room. So given that I've had 46 days of couchtime in the past 2 months, we can say that I watched The Lego Movie from my couch 6 / 10 times.

P(B|A) = P(I sat on the couch given that I watched The Lego Movie) = 6 / 10 = 0.60

Ok, ready for the magic! Using Bayes' theorem, I now have everything I need to calculate the Probability that I watched The Lego Movie today given that I sat on the couch.

P(A|B)=P(B|A)*P(A)/P(B) = (0.60 * 0.17) / 0.76

P(I watched The Lego Movie given that I sat on the couch) = 0.13

And voilà! Given that I sat on the couch today, there is a 13% chance that I also watched The Lego Movie (wow, that's a lot of Lego time).

Now I wonder what the probability of me watching The Lego Movie from a double decker couch would be?

Why should I use it?

Where you see Naive Bayes classifiers pop up a lot is in document classification. Naive Bayes is a great choice for this because it's pretty fast, it can handle a large number of features (i.e. words), and it's actually really effective. Take a look at what happens when you do some basic benchmarking between Naive Bayes and other methods like SVM and RandomForest against the 20 Newsgroups dataset.Naive Bayes wins! Granted this is a relatively simple approach without much in terms of feature engineering, but in my opinion that's part of the beauty of Naive Bayes!

Code for benchmarking is available here.

Document Classification

For our example we're going to be attempting to classify whether a wikipedia page is referring to a dinosaur or a cryptid (an animal from cryptozoology. Think Lochness Monster or Bigfoot).

My favorite Yeti, The Bumble, from the stop-motion holiday classic Rudolph the Red-nosed Reindeer

We'll be using the text from each wikipedia article as features. What we'd expect is that certain words like "sighting" or "hoax" would be more commonly found in articles about cryptozoology, while words like "fossil" would be more commonly found in articles about dinosaurs.

We'll do some basic word-tokenization to count the occurrences of each word and then calculate conditional probabilities for each word as it pertains to our 2 categories.

You can find the sample documents I used and the corresponding code here.

Tokenizing and counting

First things first. We need to turn our files full of text into something a little more mathy. The simplest way to do this is to take the bag of words approach. That just means we'll be counting how many times each word appears in each document. We'll also perform a little text normalization by removing punctuation and lowercasing the text (this means "Hello," and "hello" will now be considered the same word).

Once we've cleaned the text, we need a way to delineate words. A simple approach is to just use a good 'ole regex that splits on whitespace and punctuation: \W+.

import re
import string
from prettytable import PrettyTable

def remove_punctuation(s):
    "see http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python"
    table = string.maketrans("","")
    return s.translate(table, string.punctuation)

def tokenize(text):
    text = remove_punctuation(text)
    text = text.lower()
    return re.split("\W+", text)

def count_words(words):
    wc = {}
    for word in words:
        wc[word] = wc.get(word, 0.0) + 1.0
    return wc

s = "Hello my name, is Greg. My favorite food is pizza."
count_words(tokenize(s))
{'favorite': 1.0, 'food': 1.0, 'greg': 1.0, 'hello': 1.0, 'is': 2.0, 'my': 2.0, 'name': 1.0, 'pizza': 1.0}

Calculating our probabilities

So now that we can count words, let's get cooking. The code below is going to do the following:

open each document
label it as either "crypto" or "dino" and keep track of how many of each label there are (priors)
count the words for the document
add those counts to the vocab, or a corpus level word count
add those counts to the word_counts, for a category level word count

from sh import find

# setup some structures to store our data
vocab = {}
word_counts = {
    "crypto": {},
    "dino": {}
}
priors = {
    "crypto": 0.,
    "dino": 0.
}
docs = []
for f in find("sample-data"):
    f = f.strip()
    if f.endswith(".txt")==False:
        # skip non .txt files
        continue
    elif "cryptid" in f:
        category = "crypto"
    else:
        category = "dino"
    docs.append((category, f))
    # ok time to start counting stuff...
    priors[category] += 1
    text = open(f).read()
    words = tokenize(text)
    counts = count_words(words)
    for word, count in counts.items():
        # if we haven't seen a word yet, let's add it to our dictionaries with a count of 0
        if word not in vocab:
            vocab[word] = 0.0 # use 0.0 here so Python does "correct" math
        if word not in word_counts[category]:
            word_counts[category][word] = 0.0
        vocab[word] += count
        word_counts[category][word] += count

Classifying a new page

And finally it's time for the math. We're going to use the word counts we calculated in the previous step to calculate the following:

Prior Probability for each category, or for the layman, the percentage of documents that belong to each category. We have 9 crypto docs and 8 dino docs, so that gives us the following:

Prior Prob(crypto) = 9 / (8 + 9) = 0.53

Prior Prob(dino) = 8 / (8 + 9) = 0.47

Ok priors, check. The next thing we need are conditional probabilities for the words in the document we're trying to classify. How do we do that? Well we start by doing a word count on a new document. We'll use the Yeti page as our new document.

new_doc = open("Yeti.txt").read()
words = tokenize(new_doc)
counts = count_words(words)​

Alright, we've got our counts. Now we'll calculate P(word|category) for each word and multiply each of these conditional probabilities together to calculate the P(category|set of words). To prevent computational errors, we're going to perform the operations in logspace. All this means is we're going to use the log(probability) so we require fewer decimal places. More on the mystical properties of logs here and here.

import math

prior_dino = (priors["dino"] / sum(priors.values()))
prior_crypto = (priors["crypto"] / sum(priors.values()))

log_prob_crypto = 0.0
log_prob_dino = 0.0
for w, cnt in counts.items():
    # skip words that we haven't seen before, or words less than 3 letters long
    if not w in vocab or len(w) <= 3:
        continue
    # calculate the probability that the word occurs at all
    p_word = vocab[w] / sum(vocab.values())
    # for both categories, calculate P(word|category), or the probability a 
    # word will appear, given that we know that the document is <category>
    p_w_given_dino = word_counts["dino"].get(w, 0.0) / sum(word_counts["dino"].values())
    p_w_given_crypto = word_counts["crypto"].get(w, 0.0) / sum(word_counts["crypto"].values())
    # add new probability to our running total: log_prob_<category>. if the probability 
    # is 0 (i.e. the word never appears for the category), then skip it
    if p_w_given_dino > 0:
        log_prob_dino += math.log(cnt * p_w_given_dino / p_word)
    if p_w_given_crypto > 0:
        log_prob_crypto += math.log(cnt * p_w_given_crypto / p_word)

# print out the reuslts; we need to go from logspace back to "regular" space,
# so we take the EXP of the log_prob (don't believe me? try this: math.exp(log(10) + log(3)))
print "Score(dino)  :", math.exp(log_prob_dino + math.log(prior_dino))
print "Score(crypto):", math.exp(log_prob_crypto + math.log(prior_crypto))
# Score(dino)  : 2601.76647058
# Score(crypto): 25239.089932

Since we're slightly bending the rules of Bayes' Theorem, the results are not actual probabilities, but rather are "scores". All you really need to know is which one is bigger. So our suspicions are confirmed, the "Yeti.txt" file is being classified overwhelmingly in favor of crypto (as we would hope).

Bringing it home, the LEGO Yeti!

Final Thoughts

You can find all the code and documents used in this post on GitHub.

Naive Bayes is great because it's fairly easy to see what's going on under the hood. It's a great way to start any text analysis and it can easily scale out of core to work in a distributed environment. There are some excellent implementations in the Python community you can use as well, so if you don't want to roll your own, have no fear! The scikit-learn and nltk versions are great places to start.