Clojure Data Analysis Cookbook, Second Edition (2015)

Chapter 10. Working with Unstructured and Textual Data

In this chapter, we will cover the following recipes:

·        Tokenizing text

·        Finding sentences

·        Focusing on content words with stoplists

·        Getting document frequencies

·        Scaling document frequencies by document size

·        Scaling document frequencies with TF-IDF

·        Finding people, places, and things with Named Entity Recognition

·        Mapping documents to a sparse vector space representation

·        Performing topic modeling with MALLET

·        Performing naïve Bayesian classification with MALLET

Introduction

We've been talking about all of the data that's out there in the world. However, structured or semistructured data—the kind you'd find in spreadsheets or in tables on web pages—is vastly overshadowed by the unstructured data that's being produced. This includes news articles, blog posts, tweets, Hacker News discussions, StackOverflow questions and responses, and any other natural text that seems like it is being generated by the petabytes daily.

This unstructured content contains information. It has rich, subtle, and nuanced data, but getting it is difficult. In this chapter, we'll explore some ways to get some of the information out of unstructured data. It won't be fully nuanced and it will be very rough, but it's a start. We've already looked at how to acquire textual data. In Chapter 1Importing Data for Analysis, we looked at this in the Scraping textual data from web pages recipe. Still, the Web is going to be your best source for data.

Tokenizing text

Before we can do any real analysis of a text or a corpus of texts, we have to identify the words in the text. This process is called tokenization. The output of this process is a list of words, and possibly includes punctuation in a text. This is different from tokenizing formal languages such as programming languages: it is meant to work with natural languages and its results are less structured.

It's easy to write your own tokenizer, but there are a lot of edge and corner cases to take into consideration and account for. It's also easy to include a natural language processing (NLP) library that includes one or more tokenizers. In this recipe, we'll use the OpenNLP (http://opennlp.apache.org/) and its Clojure wrapper (https://clojars.org/clojure-opennlp).

Getting ready

We'll need to include the clojure-opennlp in our project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

  :dependencies [[org.clojure/clojure "1.6.0"]

                 [clojure-opennlp "0.3.2"]])

We will also need to require it into the current namespace, as follows:

(require '[opennlp.nlp :as nlp])

Finally, we'll download a model for a statistical tokenizer. I downloaded all of the files from http://opennlp.sourceforge.net/models-1.5/. I then saved them into models/.

How to do it…

In order to tokenize a document, we'll need to first create the tokenizer. We can do this by loading the model:

(def tokenize (nlp/make-tokenizer "models/en-token.bin"))

Then, we use it by passing a string to this tokenizer object:

user=> (tokenize "This is a string.")

["This" "is" "a" "string" "."]

user=> (tokenize "This isn't a string.")

["This" "is" "n't" "a" "string" "."]

How it works…

In OpenNLP, tokenizers are statistically trained to identify tokens in a language, based on the language used in the text. The en-token.bin file contains the information for a trained tokenizer for English. In the second example of the previous section, we can see that it correctly pulls the contracted not from the base word, is.

Once we load this data back into a tokenizer, we can use it again to pull the tokens out.

The main catch is that the language used to generate the model data has to match the language in the input string that we're attempting to tokenize.

Finding sentences

Words (tokens) aren't the only structures that we're interested in, however. Another interesting and useful grammatical structure is the sentence. In this recipe, we'll use a process similar to the one we used in the previous recipe, Tokenizing text, in order to create a function that will pull sentences from a string in the same way that tokenize pulled tokens from a string in the last recipe.

Getting ready

We'll need to include clojure-opennlp in our project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

  :dependencies [[org.clojure/clojure "1.6.0"]

                 [clojure-opennlp "0.3.2"]])

We will also need to require it into the current namespace:

(require '[opennlp.nlp :as nlp])

Finally, we'll download a model for a statistical sentence splitter. I downloaded en-sent.bin from http://opennlp.sourceforge.net/models-1.5/. I then saved it into models/en-sent.bin.

How to do it…

As in the Tokenizing text recipe, we will start by loading the sentence identification model data, as shown here:

(def get-sentences

  (nlp/make-sentence-detector "models/en-sent.bin"))

Now, we use that data to split a text into a series of sentences, as follows:

user=> (get-sentences "I never saw a Purple Cow.

           I never hope to see one.

           But I can tell you, anyhow.

           I'd rather see than be one.")

 ["I never saw a Purple Cow."

  "I never hope to see one."

  "But I can tell you, anyhow."

  "I'd rather see than be one."]

How it works…

The data model in models/en-sent.bin contains the information that OpenNLP needs to recreate a previously-trained sentence identification algorithm. Once we have reinstantiated this algorithm, we can use it to identify the sentences in a text, as we did by callingget-sentences.

Focusing on content words with stoplists

Stoplists or stopwords are a list of words that should not be included in further analysis. Usually, this is because they're so common that they don't add much information to the analysis.

These lists are usually dominated by what are known as function words—words that have a grammatical purpose in the sentence, but which themselves do not carry any meaning. For example, the indicates that the noun that follows is singular, but it does not have a meaning by itself. Others prepositions, such as after, have a meaning, but they are so common that they tend to get in the way.

On the other hand, chair has a meaning beyond what it's doing in the sentence, and in fact, it's role in the sentence will vary (subject, direct object, and so on).

You don't always want to use stopwords since they throw away information. However, since function words are more frequent than content words, sometimes focusing on the content words can add clarity to your analysis and its output. Also, they can speed up the processing.

Getting ready

This recipe will build on the work that we've done so far in this chapter. As such, it will use the same project.clj file that we used in the Tokenizing text and Finding sentences recipes:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

  :dependencies [[org.clojure/clojure "1.6.0"]

                 [clojure-opennlp "0.3.2"]])

However, we'll use a slightly different set of requirements for this recipe:

(require '[opennlp.nlp :as nlp]

         '[clojure.java.io :as io])

We'll also need to have a list of stopwords. You can easily create your own list, but for the purpose of this recipe, we'll use the English stopword list included with the Natural Language Toolkit (http://www.nltk.org/). You can download this fromhttp://nltk.github.com/nltk_data/packages/corpora/stopwords.zip. Unzip it into your project directory and make sure that the stopwords/english file exists.

We'll also use the tokenize and get-sentences functions that we created in the previous two recipes.

How to do it…

We'll need to create a function in order to process and normalize the tokens. Also, we'll need a utility function to load the stopword list. Once these are in place, we'll see how to use the stopwords. To do this, perform the following steps:

1.    The words in the stopword list have been lowercased. We can also do this with the tokens that we create. We'll use the normalize function to handle the lowercasing of each token:

2.  (defn normalize [token-seq]

  (map #(.toLowerCase %) token-seq))

3.    The stoplist will actually be represented by a Clojure set. This will make filtering a lot easier. The load-stopwords function will read in the file, break it into lines, and fold them into a set, as follows:

4.  (defn load-stopwords [filename]

5.    (with-open [r (io/reader filename)]

6.      (set (doall (line-seq r)))))

(def is-stopword (load-stopwords "stopwords/english"))

7.    Finally, we can load the tokens. This will break the input into sentences. Then, it will tokenize each sentence, normalize its tokens, and remove its stopwords, as follows:

8.  (def tokens

9.    (map #(remove is-stopword (normalize (tokenize %)))

10.       (get-sentences

11.         "I never saw a Purple Cow.

12.         I never hope to see one.

13.         But I can tell you, anyhow.

         I'd rather see than be one.")))

Now, you can see that the tokens returned are more focused on the content and are missing all of the function words:

user=> (pprint tokens)

(("never" "saw" "purple" "cow" ".")

 ("never" "hope" "see" "one" ".")

 ("tell" "," "anyhow" ".")

 ("'d" "rather" "see" "one" "."))

Getting document frequencies

One common and useful metric to work with text corpora is to get the counts of the tokens in the documents. This can be done quite easily by leveraging standard Clojure functions.

Let's see how.

Getting ready

We'll continue building on the previous recipes in this chapter. Because of that, we'll use the same project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

  :dependencies [[org.clojure/clojure "1.6.0"]

                 [clojure-opennlp "0.3.2"]])

We'll also use tokenize, get-sentences, normalize, load-stopwords, and is-stopword from the earlier recipes.

We'll also use the value of the tokens that we saw in the Focusing on content words with stoplists recipe. Here it is again:

(def tokens

  (map #(remove is-stopword (normalize (tokenize %)))

       (get-sentences

         "I never saw a Purple Cow.

         I never hope to see one.

         But I can tell you, anyhow.

         I'd rather see than be one.")))

How to do it…

Of course, the standard function to count items in a sequence is frequencies. We can use this to get the token counts for each sentence, but then we'll also want to fold those into a frequency table using merge-with:

(def token-freqs

  (apply merge-with + (map frequencies tokens)))

We can print or query this table to get the count for any token or piece of punctuation, as follows:

user=> (pprint token-freqs)

{"see" 2,

 "purple" 1,

 "tell" 1,

 "cow" 1,

 "anyhow" 1,

 "hope" 1,

 "never" 2,

 "saw" 1,

 "'d" 1,

 "." 4,

 "one" 2,

 "," 1,

 "rather" 1}

Scaling document frequencies by document size

While raw token frequencies can be useful, they often have one major problem: comparing frequencies with different documents is complicated if the document sizes are not the same. If the word customer appears 23 times in a 500-word document and it appears 40 times in a 1,000-word document, which one do you think is more focused on that word? It's difficult to say.

To work around this, it's common to scale the tokens frequencies for each document by the size of the document. That's what we'll do in this recipe.

Getting ready

We'll continue building on the previous recipes in this chapter. Because of that, we'll use the same project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

  :dependencies [[org.clojure/clojure "1.6.0"]

                 [clojure-opennlp "0.3.2"]])

We'll use the token frequencies that we figured from the Getting document frequencies recipe. We'll keep them bound to the name token-freqs.

How to do it…

The function used to perform this scaling is fairly simple. It calculates the total number of tokens by adding the values from the frequency hashmap and then it walks over the hashmap again, scaling each frequency, as shown here:

(defn scale-by-total [freqs]

  (let [total (reduce + 0 (vals freqs))]

    (->> freqs

         (map #(vector (first %) (/ (second %) total)))

         (into {}))))

We can now use this on token-freqs from the last recipe:

user=> (pprint (scale-by-total token-freqs))

{"see" 2/19,

 "purple" 1/19,

 "tell" 1/19,

 "cow" 1/19,

 "anyhow" 1/19,

 "hope" 1/19,

 "never" 2/19,

 "saw" 1/19,

 "'d" 1/19,

 "." 4/19,

 "one" 2/19,

 "," 1/19,

 "rather" 1/19}

Now, we can easily compare these values to the frequencies generated from other documents.

How it works…

This works by changing all of the raw frequencies into ratios based on each document's size.

These numbers are comparable. In our example, from the introduction to this recipe, 0.046 (23/500) is obviously slightly more than 0.040 (40/1000). However, both of these numbers are ridiculously high. Words that typically occur this much in English are words such as the.

Document-scaled frequencies do have problems with shorter texts. For example, take this tweet by the Twitter user @LegoAcademics:

"Dr Brown's random number algorithm is based on the baffling floor sequences chosen by the Uni library elevator".

In this tweet, let's see what the scaled frequency of random is:

(-> (str "Dr Brown's random number algorithm is based "

         "on the baffling floor seqeuences chosen by "

         "the Uni library elevator.")

    tokenize

    normalize

    frequencies

    scale-by-total

    (get "random")

    float)

This gives us 0.05. Again, this is ridiculously high. Most other tweets won't include the term random at all. Because of this, you still can only compare tweets.

Scaling document frequencies with TF-IDF

In the last few recipes, we've seen how to generate term frequencies and scale them by the size of the document so that the frequencies from two different documents can be compared.

Term frequencies also have another problem. They don't tell you how important a term is, relative to all of the documents in the corpus.

To address this, we will use term frequency-inverse document frequency (TF-IDF). This metric scales the term's frequency in a document by the term's frequency in the entire corpus.

In this recipe, we'll assemble the parts needed to implement TF-IDF.

Getting ready

We'll continue building on the previous recipes in this chapter. Because of that, we'll use the same project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

  :dependencies [[org.clojure/clojure "1.6.0"]

                 [clojure-opennlp "0.3.2"]])

We'll also use two functions that we've created earlier in this chapter. From the Tokenizing text recipe, we'll use tokenize. From the Focusing on content words with stoplists recipe, we'll use normalize.

Aside from the imports required for these two functions, we'll also want to have this available in our source code or REPL:

(require '[clojure.set :as set])

For this recipe, we'll also need more data than we've been using. For this, we'll use a corpus of State of the Union (SOTU) addresses from United States presidents over time. These are yearly addresses that presidents make where they talk about the events of the past year and outline their priorities over the next twelve months. You can download these from http://www.ericrochester.com/clj-data-analysis/data/sotu.tar.gz. I've unpacked the data from this file into the sotu directory.

How to do it…

The following image shows the group of functions that we'll be coding in this recipe:

How to do it…

So, in English, the function for tf represents the frequency of the term t in the document d, scaled by the maximum term frequency in d. In other words, unless you're using a stoplist, this will almost always be the frequency of the term the.

The function for idf is the log of the number of documents (N) divided by the number of documents that contain the term t.

These equations break the problem down well. We can write a function for each one of these. We'll also create a number of other functions to help us along. Let's get started:

1.    For the first function, we'll implement the tf component of the equation. This is a transparent translation of the tf function from earlier. It takes a term's frequency and the maximum term frequency from the same document, as follows:

2.  (defn tf [term-freq max-freq]

  (+ 0.5 (/ (* 0.5 term-freq) max-freq)))

3.    Now, we'll do the most basic implementation of idf. Like the tf function used earlier, it's a close match to the idf equation:

4.  (defn idf [corpus term]

5.    (Math/log

6.      (/ (count corpus)

7.         (inc (count

              (filter #(contains? % term) corpus))))))

8.    Now, we'll take a short detour in order to optimize prematurely. In this case, the IDF values will be the same for each term across the corpus, but if we're not careful, we'll code this so that we're computing these terms for each document. For example, the IDF value for the will be the same, no matter how many times the actually occurs in the current document. We can precompute these and cache them. However, before we can do that, we'll need to obtain the set of all the terms in the corpus. The get-corpus-termsfunction does this, as shown here:

9.  (defn get-corpus-terms [corpus]

10.  (->> corpus

11.       (map #(set (keys %)))

       (reduce set/union #{})))

12.          The get-idf-cache function takes a corpus, extracts its term set, and returns a hashmap associating the terms with their IDF values, as follows:

13.(defn get-idf-cache [corpus]

14.  (reduce #(assoc %1 %2 (idf corpus %2)) {}

          (get-corpus-terms corpus)))

15.          Now, the tf-idf function is our lowest-level function that combines tf and idf. It just takes the raw parameters, including the cached IDF value, and performs the necessary calculations:

16.(defn tf-idf [idf-value freq max-freq]

  (* (tf freq max-freq) idf-value))

17.          The tf-idf-pair function sits immediately on top of tf-idf. It gets the IDF value from the cache, and for one of its parameters, it takes a term-raw frequency pair. It returns the pair with the frequency being the TF-IDF for that term:

18.(defn tf-idf-pair [idf-cache max-freq pair]

19.  (let [[term freq] pair]

    [term (tf-idf (idf-cache term) freq max-freq)]))

20.          Finally, the tf-idf-freqs function controls the entire process. It takes an IDF cache and a frequency hashmap, and it scales the frequencies in the hashmap into their TF-IDF equivalents, as follows:

21.(defn tf-idf-freqs [idf-cache freqs]

22.  (let [max-freq (reduce max 0 (vals freqs))]

23.    (->> freqs

24.         (map #(tf-idf-pair idf-cache max-freq %))

         (into {}))))

Now, we have all of the pieces in place to use this.

1.    For this example, we'll read all of the State of the Union addresses into a sequence of raw frequency hashmaps. This will be bound to the name corpus:

2.  (def corpus

3.    (->> "sotu"

4.         (java.io.File.)

5.         (.list)

6.         (map #(str "sotu/" %))

7.         (map slurp)

8.         (map tokenize)

9.         (map normalize)

       (map frequencies)))

10.          We'll use these frequencies to create the IDF cache and bind it to the name cache:

(def cache (get-idf-cache corpus))

11.          Now, actually calling tf-idf-freqs on these frequencies is straightforward, as shown here:

(def freqs (map #(tf-idf-freqs cache %) corpus))

How it works…

TF-IDF scales the raw token frequencies by the number of documents they occur in within the corpus. This identifies the distinguishing words for each document. After all, if the word occurs in almost every document, it won't be a distinguishing word for any document. However, if a word is only found in one document, it helps to distinguish that document.

For example, here are the 10 most distinguishing words from the first SOTU address:

user=> (doseq [[term idf-freq] (->> freqs

                                   first

                                   (sort-by second)

                                   reverse

                                   (take 10))]

         (println [term idf-freq ((first corpus) term)]))

[intimating 2.39029215473352 1]

[licentiousness 2.39029215473352 1]

[discern 2.185469574348983 1]

[inviolable 2.0401456408424132 1]

[specify 1.927423640693998 1]

[comprehending 1.8353230604578765 1]

[novelty 1.8353230604578765 1]

[well-digested 1.8353230604578765 1]

[cherishing 1.8353230604578765 1]

[cool 1.7574531294111173 1]

You can see that these words all occur once in this document, and in fact, intimating and licentiousness are only found in the first SOTU, and all 10 of these words are found in six or fewer addresses.

Finding people, places, and things with Named Entity Recognition

One thing that's fairly easy to pull out of documents is named items. This includes things such as people's names, organizations, locations, and dates. These algorithms are called Named Entity Recognition (NER), and while they are not perfect, they're generally pretty good. Error rates under 0.1 are normal.

The OpenNLP library has classes to perform NER, and depending on what you train them with, they will identify people, locations, dates, or a number of other things. The clojure-opennlp library also exposes these classes in a good, Clojure-friendly way.

Getting ready

We'll continue building on the previous recipes in this chapter. Because of this, we'll use the same project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

  :dependencies [[org.clojure/clojure "1.6.0"]

                 [clojure-opennlp "0.3.2"]])

From the Tokenizing text recipe, we'll use tokenize, and from the Focusing on content words with stoplists recipe, we'll use normalize.

Pretrained models can be downloaded from http://opennlp.sourceforge.net/models-1.5/. I downloaded en-ner-person.bin, en-ner-organization.bin, en-ner-date.bin, en-ner-location.bin, and en-ner-money.bin. Then, I saved these models in models/.

How to do it…

To set things up, we have to load the models and bind them to function names. To load the models, we'll use the opennlp.nlp/make-name-finder function. We can use this to load each recognizer individually, as follows:

(def get-persons

  (nlp/make-name-finder "models/en-ner-person.bin"))

(def get-orgs

  (nlp/make-name-finder "models/en-ner-organization.bin"))

(def get-date

  (nlp/make-name-finder "models/en-ner-date.bin"))

(def get-location

  (nlp/make-name-finder "models/en-ner-location.bin"))

(def get-money

  (nlp/make-name-finder "models/en-ner-money.bin"))

Now, in order to test this out, let's load the latest SOTU address in our corpus. This is Barak Obama's 2013 State of the Union:

(def sotu (tokenize (slurp "sotu/2013-0.txt")))

We can call each of these functions on the tokenized text to see the results, as shown here:

user=> (get-persons sotu)

("John F. Kennedy" "Most Americans—Democrats" "Government" "John McCain" "Joe Lieberman" "So" "Tonight I" "Joe Biden" "Joe" "Tonight" "Al Qaida" "Russia" "And" "Michelle" "Hadiya Pendleton" "Gabby Giffords" "Menchu Sanchez" "Desiline Victor" "Brian Murphy" "Brian")

user=> (get-orgs sotu)

("Congress" "Union" "Nation" "America" "Tax" "Apple" "Department of Defense and Energy" "CEOs" "Siemens America—a" "New York Public Schools" "City University of New York" "IBM" "American" "Higher Education" "Federal" "Senate" "House" "CEO" "European Union" "It")

user=> (get-date sotu)

("this year" "18 months ago" "Last year" "Today" "last 15" "2007" "today" "tomorrow" "20 years ago" "This" "last year" "This spring" "next year" "2014" "next two decades" "next month" "a")

user=> (get-location sotu)

("Washington" "United States of America" "Earth" "Japan" "Mexico" "America" "Youngstown" "Ohio" "China" "North Carolina" "Georgia" "Oklahoma" "Germany" "Brooklyn" "Afghanistan" "Arabian Peninsula" "Africa" "Libya" "Mali" "And" "North Korea" "Iran" "Russia" "Asia" "Atlantic" "United States" "Rangoon" "Burma" "the Americas" "Europe" "Middle East" "Egypt" "Israel" "Chicago" "Oak Creek" "New York City" "Miami" "Wisconsin")

user=> (get-money sotu)

("$ 2.5 trillion" "$ 4 trillion" "$ 140 to" "$")

How it works…

When you glance at the results, you can see that it appears to have performed well. We need to look into the document and see what it missed to be certain, of course.

The process to use this is similar to the tokenizer or sentence chunker: load the model from a file and then call the result as a function.

Mapping documents to a sparse vector space representation

Many text algorithms deal with vector space representations of the documents. This means that the documents are normalized into vectors. Each individual token type is assigned one position across all the documents' vectors. For instance, text might have position 42, so index 42 in all the document vectors will have the frequency (or other value) of the word text.

However, most documents won't have anything for most words. This makes them sparse vectors, and we can use more efficient formats for them.

The Colt library (http://acs.lbl.gov/ACSSoftware/colt/) contains implementations of sparse vectors. For this recipe, we'll see how to read a collection of documents into these.

Getting ready…

For this recipe, we'll need the following in our project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

  :dependencies [[org.clojure/clojure "1.6.0"]

                 [clojure-opennlp "0.3.2"]

                 [colt/colt "1.2.0"]])

For our script or REPL, we'll need these libraries:

(require '[clojure.set :as set]

         '[opennlp.nlp :as nlp])

(import [cern.colt.matrix DoubleFactory2D])

From the previous recipes, we'll use several functions. From the Tokenizing text recipe, we'll use tokenize and normalize, and from the Scaling document frequencies with TF-IDF recipe, we'll use get-corpus-terms.

For the data, we'll again use the State of the Union address that we first saw in the Scaling document frequencies in TF-IDF recipe. You can download these from http://www.ericrochester.com/clj-data-analysis/data/sotu.tar.gz. I've unpacked the data from this file into the sotu directory.

How to do it…

In order to create vectors of all the documents, we'll first need to create a token index that maps tokens to the indexes in the vector. We'll then use that to create a sequence of Colt vectors. Finally, we can load the SOTU addresses and generate sparse feature vectors of all the documents, as follows:

1.    Before we can create the feature vectors, we need to have a token index so that the vector indexes will be consistent across all of the documents. The build-index function takes care of this:

2.  (defn build-index [corpus]

3.    (into {}

4.          (zipmap (tfidf/get-corpus-terms corpus)

                (range))))

5.    Now, we can use build-index to convert a sequence of token-frequency pairs into a feature vector. All of the tokens must be in the index:

6.  (defn ->matrix [index pairs]

7.    (let [matrix (.make DoubleFactory2D/sparse

8.                   1 (count index) 0.0)

9.          inc-cell (fn [m p]

10.                   (let [[k v] p,

11.                         i (index k)]

12.                     (.set m 0 i v)

13.                     m))]

    (reduce inc-cell matrix pairs)))

With these in place, let's make use of them by loading the token frequencies in a corpus and then create the index from this:

(def corpus

  (->> "sotu"

       (java.io.File.)

       (.list)

       (map #(str "sotu/" %))

       (map slurp)

       (map tokenize)

       (map normalize)

       (map frequencies)))

(def index (build-index corpus))

With the index, we can finally move the information of the document frequencies into sparse vectors:

(def vecs (map #(->matrix index %) corpus))

Performing topic modeling with MALLET

Previously in this chapter, we looked at a number of ways to programmatically see what's present in documents. We saw how to identify people, places, dates, and other things in documents. We saw how to break things up into sentences.

Another, more sophisticated way to discover what's in a document is to use topic modeling. Topic modeling attempts to identify a set of topics that are contained in the document collection. Each topic is a cluster of words that are used together throughout the corpus. These clusters are found in individual documents to varying degrees, and a document is composed of several topics to varying extents. We'll take a look at this in more detail in the explanation for this recipe.

To perform topic modeling, we'll use MALLET (http://mallet.cs.umass.edu/). This is a library and utility that implements topic modeling in addition to several other document classification algorithms.

Getting ready

For this recipe, we'll need these lines in our project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

  :dependencies [[org.clojure/clojure "1.6.0"]

                 [cc.mallet/mallet "2.0.7"]])

Our imports and requirements for this are pretty extensive too, as shown here:

(require '[clojure.java.io :as io])

(import [cc.mallet.util.*]

        [cc.mallet.types InstanceList]

        [cc.mallet.pipe

         Input2CharSequence TokenSequenceLowercase

         CharSequence2TokenSequence SerialPipes

         TokenSequenceRemoveStopwords

         TokenSequence2FeatureSequence]

        [cc.mallet.pipe.iterator FileListIterator]

        [cc.mallet.topics ParallelTopicModel]

        [java.io FileFilter])

Again, we'll use the State of the Union addresses that we've already seen several timesin this chapter. You can download these from http://www.ericrochester.com/clj-data-analysis/data/sotu.tar.gz. I've unpacked the data from this file into the sotu directory.

How to do it…

We'll need to work the documents through several phases to perform topic modeling, as follows:

1.    Before we can process any documents, we'll need to create a processing pipeline. This defines how the documents should be read, tokenized, normalized, and so on:

2.  (defn make-pipe-list []

3.    (InstanceList.

4.      (SerialPipes.

5.        [(Input2CharSequence. "UTF-8")

6.         (CharSequence2TokenSequence.

7.           #"\p{L}[\p{L}\p{P}]+\p{L}")

8.         (TokenSequenceLowercase.)

9.         (TokenSequenceRemoveStopwords. false false)

       (TokenSequence2FeatureSequence.)])))

10.          Now, we'll create a function that takes the processing pipeline and a directory of data files, and it will run the files through the pipeline. This returns an InstanceList, which is a collection of documents along with their metadata:

11.(defn add-directory-files [instance-list corpus-dir]

12.  (.addThruPipe

13.    instance-list

14.    (FileListIterator.

15.      (.listFiles (io/file corpus-dir))

16.      (reify FileFilter

17.        (accept [this pathname] true))

18.      #"/([^/]*).txt$"

      true)))

19.          The last function takes the InstanceList and some other parameters and trains a topic model, which it returns:

20.(defn train-model

21.  ([instances] (train-model 100 4 50 instances))

22.  ([num-topics num-threads num-iterations instances]

23.   (doto (ParallelTopicModel. num-topics 1.0 0.01)

24.     (.addInstances instances)

25.     (.setNumThreads num-threads)

26.     (.setNumIterations num-iterations)

     (.estimate))))

Now, we can take these three functions and use them to train a topic model. While training, it will output some information about the process, and finally, it will list the top terms for each topic:

user=> (def pipe-list (make-pipe-list))

user=> (add-directory-files pipe-list "sotu/")

user=> (def tm (train-model 10 4 50 pipe-list))

INFO:

0       0.1     government federal year national congress war

1       0.1     world nation great power nations people

2       0.1     world security years programs congress program

3       0.1     law business men work people good

4       0.1     america people americans american work year

5       0.1     states government congress public people united

6       0.1     states public made commerce present session

7       0.1     government year department made service legislation

8       0.1     united states congress act government war

9       0.1     war peace nation great men people

How it works…

It's difficult to succinctly and clearly explain how topic modeling works. Conceptually, it assigns words from the documents to buckets (topics). This is done in such a way that randomly drawing words from the buckets will most probably recreate the documents.

Interpreting the topics is always interesting. Generally, it involves taking a look at the top words for each topic and cross-referencing them with the documents that scored most highly for this topic.

For example, take the fourth topic, with the top words lawbusinessmen, and work. The top-scoring document for this topic was the 1908 SOTU, with a distribution of 0.378. This was given by Theodore Roosevelt, and in his speech, he talked a lot about labor issues and legislation to rein in corrupt corporations. All of the words mentioned were used a lot, but understanding exactly what the topic is about isn't evident without actually taking a look at the document itself.

See also…

There are a number of good papers and tutorials on topic modeling. There's a good tutorial written by Shawn Graham, Scott Weingart, and Ian Milligan at http://programminghistorian.org/lessons/topic-modeling-and-mallet

For a more rigorous explanation, check out Mark Steyvers's introduction Probabilistic Topic Models, which you can see at http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf

For some information on how to evaluate the topics that you get, see http://homepages.inf.ed.ac.uk/imurray2/pub/09etm

Performing naïve Bayesian classification with MALLET

MALLET has gotten its reputation as a library for topic modeling. However, it also has a lot of other algorithms in it.

One popular algorithm that MALLET implements is naïve Bayesian classification. If you have documents that are already divided into categories, you can train a classifier to categorize new documents into those same categories. Often, this works surprisingly well.

One common use for this is in spam e-mail detection. We'll use this as our example here too.

Getting ready

We'll need to have MALLET included in our project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"

  :dependencies [[org.clojure/clojure "1.6.0"]

                 [cc.mallet/mallet "2.0.7"]])

Just as in the Performing topic modeling with MALLET recipe, the list of classes to be included is a little long, but most of them are for the processing pipeline, as shown here:

(require '[clojure.java.io :as io])

(import [cc.mallet.util.*]

        [cc.mallet.types InstanceList]

        [cc.mallet.pipe

         Input2CharSequence TokenSequenceLowercase

         CharSequence2TokenSequence SerialPipes

         SaveDataInSource Target2Label

         TokenSequence2FeatureSequence

         TokenSequenceRemoveStopwords

         FeatureSequence2AugmentableFeatureVector]

        [cc.mallet.pipe.iterator FileIterator]

        [cc.mallet.classify NaiveBayesTrainer])

For data, we can get preclassified emails from the SpamAssassin website. Take a look at https://spamassassin.apache.org/publiccorpus/. From this directory, I downloaded 20050311_spam_2.tar.bz2, 20030228_easy_ham_2.tar.bz2, and 20030228_hard_ham.tar.bz2. I decompressed these into the training directory. This added three subdirectories: training/easy_ham_2, training/hard_ham, and training/spam_2.

I also downloaded two other archives: 20021010_hard_ham.tar.bz2 and 20021010_spam.tar.bz2. I decompressed these into the test-data directory in order to create the test-data/hard_ham and test-data/spam directories.

How to do it…

Now, we can define the functions to create the processing pipeline and a list of document instances, as well as to train the classifier and classify the documents:

1.    We'll create the processing pipeline separately. A single instance of this has to be used to process all of the training, test, and actual data. Hang on to this:

2.  (defn make-pipe-list []

3.    (SerialPipes.

4.      [(Target2Label.)

5.       (SaveDataInSource.)

6.       (Input2CharSequence. "UTF-8")

7.       (CharSequence2TokenSequence.

8.          #"\p{L}[\p{L}\p{P}]+\p{L}")

9.       (TokenSequenceLowercase.)

10.     (TokenSequenceRemoveStopwords.)

11.     (TokenSequence2FeatureSequence.)

12.     (FeatureSequence2AugmentableFeatureVector.

        false)]))

13.          We can use that to create the instance list over the files in a directory. When we do, we'll use the documents' parent directory's name as its classification. This is what we'll be training the classifier on:

14.(defn add-input-directory [dir-name pipe]

15.  (doto (InstanceList. pipe)

16.    (.addThruPipe

17.      (FileIterator. (io/file dir-name)

                     #".*/([^/]*?)/\d+\..*$"))))

18.          Finally, these two are relatively short and not strictly necessary, but it is good to have these two utility functions:

19.(defn train [instance-list]

20.  (.train (NaiveBayesTrainer.) instance-list))

21.(defn classify [bayes instance-list]

  (.classify bayes instance-list))

Now, we can use these functions to load the training documents from the training directory, train the classifier, and use it to classify the test files:

(def pipe (make-pipe-list))

(def instance-list (add-input-directory "training" pipe))

(def bayes (train instance-list))

Now we can use it to classify the test files.

(def test-list (add-input-directory "test-data" pipe))

(def classes (classify bayes test-list))

Moreover, finding the results just takes digging into the class structure:

user=> (.. (first (seq classes)) getLabeling getBestLabel

           toString)

"hard_ham"

We can use this to construct a matrix that shows how the classifier performs, as follows:

 

Expected ham

Expected spam

Actually ham

246

99

Actually spam

4

402

From this confusion matrix, you can see that it does pretty well. Moreover, it errs on misclassifying spam as ham. This is good because this means that we'd only need to dig into our spam folder for four emails.

How it works…

Naïve Bayesian classifiers work by starting with a reasonable guess about how likely a set of features are to be marked as spam. Often, this might be 50/50. Then, as it sees more and more documents and their classifications, it modifies this model, getting better results.

For example, it might notice that the word free is found in 100 ham emails but in 900 spam emails. This makes it a very strong indicator of spam, and the classifier will update its expectations accordingly. It then combines all of the relevant probabilities from the features it sees in a document in order to classify it one way or the other.

There's more…

Alexandru Nedelcu has a good introduction to Bayesian modeling and classifiers at https://www.bionicspirit.com/blog/2012/02/09/howto-build-naive-bayes-classifier.html

See also…

We'll take a look at how to use the Weka machine learning library to train a naïve Bayesian classifier in order to sort edible and poisonous mushrooms in the Classifying data with the Naïve Bayesian classifier recipe in Chapter 9Clustering, Classifying, and Working with Weka.