Data Science from Scratch: First Principles with Python (2015)

Chapter 13. Naive Bayes

It is well for the heart to be naive and for the mind not to be.

Anatole France

A social network isn’t much good if people can’t network. Accordingly, DataSciencester has a popular feature that allows members to send messages to other members. And while most of your members are responsible citizens who send only well-received “how’s it going?” messages, a few miscreants persistently spam other members about get-rich schemes, no-prescription-required pharmaceuticals, and for-profit data science credentialing programs. Your users have begun to complain, and so the VP of Messaging has asked you to use data science to figure out a way to filter out these spam messages.

A Really Dumb Spam Filter

Imagine a “universe” that consists of receiving a message chosen randomly from all possible messages. Let S be the event “the message is spam” and V be the event “the message contains the word viagra.” Then Bayes’s Theorem tells us that the probability that the message is spam conditional on containing the word viagra is:

The numerator is the probability that a message is spam and contains viagra, while the denominator is just the probability that a message contains viagra. Hence you can think of this calculation as simply representing the proportion of viagra messages that are spam.

If we have a large collection of messages we know are spam, and a large collection of messages we know are not spam, then we can easily estimate  and . If we further assume that any message is equally likely to be spam or not-spam (so that ), then:

For example, if 50% of spam messages have the word viagra, but only 1% of nonspam messages do, then the probability that any given viagra-containing email is spam is:

 

A More Sophisticated Spam Filter

Imagine now that we have a vocabulary of many words . To move this into the realm of probability theory, we’ll write  for the event “a message contains the word .” Also imagine that (through some unspecified-at-this-point process) we’ve come up with an estimate  for the probability that a spam message contains the ith word, and a similar estimate  for the probability that a nonspam message contains the ith word.

The key to Naive Bayes is making the (big) assumption that the presences (or absences) of each word are independent of one another, conditional on a message being spam or not. Intuitively, this assumption means that knowing whether a certain spam message contains the word “viagra” gives you no information about whether that same message contains the word “rolex.” In math terms, this means that:

This is an extreme assumption. (There’s a reason the technique has “naive” in its name.) Imagine that our vocabulary consists only of the words “viagra” and “rolex,” and that half of all spam messages are for “cheap viagra” and that the other half are for “authentic rolex.” In this case, the Naive Bayes estimate that a spam message contains both “viagra” and “rolex” is:

since we’ve assumed away the knowledge that “viagra” and “rolex” actually never occur together. Despite the unrealisticness of this assumption, this model often performs well and is used in actual spam filters.

The same Bayes’s Theorem reasoning we used for our “viagra-only” spam filter tells us that we can calculate the probability a message is spam using the equation:

The Naive Bayes assumption allows us to compute each of the probabilities on the right simply by multiplying together the individual probability estimates for each vocabulary word.

In practice, you usually want to avoid multiplying lots of probabilities together, to avoid a problem called underflow, in which computers don’t deal well with floating-point numbers that are too close to zero. Recalling from algebra that  and that, we usually compute  as the equivalent (but floating-point-friendlier):

The only challenge left is coming up with estimates for  and , the probabilities that a spam message (or nonspam message) contains the word . If we have a fair number of “training” messages labeled as spam and not-spam, an obvious first try is to estimate  simply as the fraction of spam messages containing word .

This causes a big problem, though. Imagine that in our training set the vocabulary word “data” only occurs in nonspam messages. Then we’d estimate . The result is that our Naive Bayes classifier would always assign spam probability 0 to any message containing the word “data,” even a message like “data on cheap viagra and authentic rolex watches.” To avoid this problem, we usually use some kind of smoothing.

In particular, we’ll choose a pseudocount  k — and estimate the probability of seeing the ith word in a spam as:

Similarly for . That is, when computing the spam probabilities for the ith word, we assume we also saw k additional spams containing the word and k additional spams not containing the word.

For example, if “data” occurs in 0/98 spam documents, and if k is 1, we estimate  as 1/100 = 0.01, which allows our classifier to still assign some nonzero spam probability to messages that contain the word “data.”

 

Implementation

Now we have all the pieces we need to build our classifier. First, let’s create a simple function to tokenize messages into distinct words. We’ll first convert each message to lowercase; use re.findall() to extract “words” consisting of letters, numbers, and apostrophes; and finally use set()to get just the distinct words:

def tokenize(message):
    message = message.lower()                       # convert to lowercase
    all_words = re.findall("[a-z0-9']+", message)   # extract the words
    return set(all_words)                           # remove duplicates

Our second function will count the words in a labeled training set of messages. We’ll have it return a dictionary whose keys are words, and whose values are two-element lists [spam_count, non_spam_count] corresponding to how many times we saw that word in both spam and nonspam messages:

def count_words(training_set):
    """training set consists of pairs (message, is_spam)"""
    counts = defaultdict(lambda: [0, 0])
    for message, is_spam in training_set:
        for word in tokenize(message):
            counts[word][0 if is_spam else 1] += 1
    return counts

Our next step is to turn these counts into estimated probabilities using the smoothing we described before. Our function will return a list of triplets containing each word, the probability of seeing that word in a spam message, and the probability of seeing that word in a nonspam message:

def word_probabilities(counts, total_spams, total_non_spams, k=0.5):
    """turn the word_counts into a list of triplets
    w, p(w | spam) and p(w | ~spam)"""
    return [(w,
             (spam + k) / (total_spams + 2 * k),
             (non_spam + k) / (total_non_spams + 2 * k))
             for w, (spam, non_spam) in counts.iteritems()]

The last piece is to use these word probabilities (and our Naive Bayes assumptions) to assign probabilities to messages:

def spam_probability(word_probs, message):
    message_words = tokenize(message)
    log_prob_if_spam = log_prob_if_not_spam = 0.0
 
    # iterate through each word in our vocabulary
    for word, prob_if_spam, prob_if_not_spam in word_probs:
 
        # if *word* appears in the message,
        # add the log probability of seeing it
        if word in message_words:
            log_prob_if_spam += math.log(prob_if_spam)
            log_prob_if_not_spam += math.log(prob_if_not_spam)
 
        # if *word* doesn't appear in the message
        # add the log probability of _not_ seeing it
        # which is log(1 - probability of seeing it)
        else:
            log_prob_if_spam += math.log(1.0 - prob_if_spam)
            log_prob_if_not_spam += math.log(1.0 - prob_if_not_spam)
 
    prob_if_spam = math.exp(log_prob_if_spam)
    prob_if_not_spam = math.exp(log_prob_if_not_spam)
    return prob_if_spam / (prob_if_spam + prob_if_not_spam)

We can put this all together into our Naive Bayes Classifier:

class NaiveBayesClassifier:
 
    def __init__(self, k=0.5):
        self.k = k
        self.word_probs = []
 
    def train(self, training_set):
 
        # count spam and non-spam messages
        num_spams = len([is_spam
                         for message, is_spam in training_set
                         if is_spam])
        num_non_spams = len(training_set) - num_spams
 
        # run training data through our "pipeline"
        word_counts = count_words(training_set)
        self.word_probs = word_probabilities(word_counts,
                                             num_spams,
                                             num_non_spams,
                                             self.k)
 
    def classify(self, message):
        return spam_probability(self.word_probs, message)

Testing Our Model

A good (if somewhat old) data set is the SpamAssassin public corpus. We’ll look at the files prefixed with 20021010. (On Windows, you might need a program like 7-Zip to decompress and extract them.)

After extracting the data (to, say, C:\spam) you should have three folders: spam, easy_ham, and hard_ham. Each folder contains many emails, each contained in a single file. To keep things really simple, we’ll just look at the subject lines of each email.

How do we identify the subject line? Looking through the files, they all seem to start with “Subject:”. So we’ll look for that:

import glob, re
 
# modify the path with wherever you've put the files
path = r"C:\spam\*\*"
 
data = []
 
# glob.glob returns every filename that matches the wildcarded path
for fn in glob.glob(path):
    is_spam = "ham" not in fn
 
    with open(fn,'r') as file:
        for line in file:
            if line.startswith("Subject:"):
                # remove the leading "Subject: " and keep what's left
                subject = re.sub(r"^Subject: ", "", line).strip()
                data.append((subject, is_spam))

Now we can split the data into training data and test data, and then we’re ready to build a classifier:

random.seed(0)      # just so you get the same answers as me
train_data, test_data = split_data(data, 0.75)
 
classifier = NaiveBayesClassifier()
classifier.train(train_data)

And then we can check how our model does:

# triplets (subject, actual is_spam, predicted spam probability)
classified = [(subject, is_spam, classifier.classify(subject))
              for subject, is_spam in test_data]
 
# assume that spam_probability > 0.5 corresponds to spam prediction
# and count the combinations of (actual is_spam, predicted is_spam)
counts = Counter((is_spam, spam_probability > 0.5)
                 for _, is_spam, spam_probability in classified)

This gives 101 true positives (spam classified as “spam”), 33 false positives (ham classified as “spam”), 704 true negatives (ham classified as “ham”), and 38 false negatives (spam classified as “ham”). This means our precision is 101 / (101 + 33) = 75%, and our recall is 101 / (101 + 38) = 73%, which are not bad numbers for such a simple model.

It’s also interesting to look at the most misclassified:

# sort by spam_probability from smallest to largest
classified.sort(key=lambda row: row[2])
 
# the highest predicted spam probabilities among the non-spams
spammiest_hams = filter(lambda row: not row[1], classified)[-5:]
 
# the lowest predicted spam probabilities among the actual spams
hammiest_spams = filter(lambda row: row[1], classified)[:5]

The two spammiest hams both have the words “needed” (77 times more likely to appear in spam), “insurance” (30 times more likely to appear in spam), and “important” (10 times more likely to appear in spam).

The hammiest spam is too short (“Re: girls”) to make much of a judgment, and the second-hammiest is a credit card solicitation most of whose words weren’t in the training set.

We can similarly look at the spammiest words:

def p_spam_given_word(word_prob):
    """uses bayes's theorem to compute p(spam | message contains word)"""
 
    # word_prob is one of the triplets produced by word_probabilities
    word, prob_if_spam, prob_if_not_spam = word_prob
    return prob_if_spam / (prob_if_spam + prob_if_not_spam)
 
words = sorted(classifier.word_probs, key=p_spam_given_word)
 
spammiest_words = words[-5:]
hammiest_words = words[:5]

The spammiest words are “money,” “systemworks,” “rates,” “sale,” and “year,” all of which seem related to trying to get people to buy things. And the hammiest words are “spambayes,” “users,” “razor,” “zzzzteana,” and “sadev,” most of which seem related to spam prevention, oddly enough.

How could we get better performance? One obvious way would be to get more data to train on. There are a number of ways to improve the model as well. Here are some possibilities that you might try:

§  Look at the message content, not just the subject line. You’ll have to be careful how you deal with the message headers.

§  Our classifier takes into account every word that appears in the training set, even words that appear only once. Modify the classifier to accept an optional min_count threshhold and ignore tokens that don’t appear at least that many times.

§  The tokenizer has no notion of similar words (e.g., “cheap” and “cheapest”). Modify the classifier to take an optional stemmer function that converts words to equivalence classes of words. For example, a really simple stemmer function might be:

§  def drop_final_s(word):
    return re.sub("s$", "", word)


Creating a good stemmer function is hard. People frequently use the Porter Stemmer.

§  Although our features are all of the form “message contains word ,” there’s no reason why this has to be the case. In our implementation, we could add extra features like “message contains a number” by creating phony tokens like contains:number and modifying the tokenizer to emit them when appropriate.

For Further Exploration

§  Paul Graham’s articles “A Plan for Spam” and “Better Bayesian Filtering” (are interesting and) give more insight into the ideas behind building spam filters.

§  scikit-learn contains a BernoulliNB model that implements the same Naive Bayes algorithm we implemented here, as well as other variations on the model.