Mastering Machine Learning with Python in Six Steps: A Practical Implementation Guide to Predictive Data Analytics Using Python (2017)

5. Step 5 – Text Mining and Recommender Systems

Manohar Swamynathan1

(1)

Bangalore, Karnataka, India

One of the key areas of artificial intelligence is Natural Language Processing (NLP) or text mining as it is generally known that deals with teaching computers how to extract meaning from text. Over the last two decades, with the explosion of the Internet world and rise of social media, there is plenty of valuable data being generated in the form of text. The process of unearthing meaningful patterns from text data is called Text Mining. In this chapter you’ll learn the high-level text mining process overview, key concepts, and common techniques involved.

Apart from scikit-learn, there is a number of established NLP-focused libraries available for Python, and the number has been growing over time. Refer to Table 5-1 for the most popular libraries based on their number of contributors as of 2016.

Table 5-1.

Python popular text mining libraries

Package

Name

# of contributors

(2016)

License

Description

NLTK

187

Apache

It’s the most popular and widely used toolkit predominantly built to support research and development of NLP.

Gensim

154

LGPL-2

Mainly built for large corpus topic modeling, document indexing, and similarity retrieval.,

spaCy

68

MIT

Built using Python + Cython for efficient production implementation of NLP concepts.

Pattern

20

BSD-3

It’s a web mining module for Python with capabilities included for scraping, NLP, machine learning and network analysis/visualization.

Polyglot

13

GPL-3

This is a multilingual text processing toolkit and supports massive multilingual applications.

Textblob

11

MIT

It’s a wrapper around NLTK and Pattern libraries for easy accessibility of their capabilities. Suitable for fast prototyping.

Note

Another well-known library is Stanford CoreNLP, a suite of the Java-based toolkit. There are number of Python wrappers available for the same; however, the number of contributors for these wrappers is on the lower side as of now .

Text Mining Process Overview

A434293_1_En_5_Fig1_HTML

Figure 5-1.

Text Mining Process Overview

The overall text mining process can be broadly categorized into four phases .

1. 1.

Text Data Assemble

2. 2.

Text Data Preprocessing

3. 3.

Data Exploration or Visualization

4. 4.

Model Building

Data Assemble (Text)

It is observed that 70% of data available to any business is unstructured. The first step is collating unstructured data from different sources such as open-ended feedback, phone calls, email support, online chat and social media networks like Twitter, LinkedIn, and Facebook. Assembling these data and applying mining/machine learning techniques to analyze them provides valuable opportunities for organizations to build more power into customer experience.

There are several libraries available for extracting text content from different formats discussed above. By far the best library that provides a simple and single interface for multiple formats is ‘textract’ (open source MIT license). Note that as of now this library/package is available for Linux, Mac OS but not Windows. Table 5-2 shows a list of supported formats .

Table 5-2.

textract supported formats

Format

Supported Via

Additional Info

.csv / .eml / .json / .odt / .txt /

Python built-ins

.doc

Antiword

http://www.winfield.demon.nl/

.docx

Python-docx

https://python-docx.readthedocs.io/en/latest/

.epub

Ebooklib

https://github.com/aerkalov/ebooklib

.gif / .jpg / .jpeg / .png / .tiff / .tif

tesseract-ocr

https://github.com/tesseract-ocr

.html / .htm

Beautifulsoup4

http://beautiful-soup-4.readthedocs.io/en/latest/

.mp3 / .ogg / .wav

SpeechRecongnition and sox

URL 1: https://pypi.python.org/pypi/SpeechRecognition/

URL 2: http://sox.sourceforge.net/

.msg

msg-extractor

https://github.com/mattgwwalker/msg-extractor

.pdf

pdftotext and pdfminer.six

URL 1: https://poppler.freedesktop.org/

URL 2: https://github.com/pdfminer/pdfminer.six

.pptx

Python-pptx

https://python-pptx.readthedocs.io/en/latest/

.ps

ps2text

http://pages.cs.wisc.edu/~ghost/doc/pstotext.htm

.rtf

Unrtf

http://www.gnu.org/software/unrtf/

.xlsx / .xls

Xlrd

https://pypi.python.org/pypi/xlrd

Let’s look at the code for the most widespread formats in the business world: pdf, jpg, and audio file . Note that extracting text from other formats is also relatively simple. See Listing 5-1.

# You can read/learn more about latest updates about textract on their official documents site at http://textract.readthedocs.io/en/latest/

import textract

# Extracting text from normal pdf

text = textract.process('Data/PDF/raw_text.pdf', language='eng')

# Extracting text from two columned pdf

text = textract.process('Data/PDF/two_column.pdf', language='eng')

# Extracting text from scanned text pdf

text = textract.process('Data/PDF/ocr_text.pdf', method='tesseract', language='eng')

# Extracting text from jpg

text = textract.process('Data/jpg/raw_text.jpg', method='tesseract', language='eng')

# Extracting text from audio file

text = textract.process('Data/wav/raw_text.wav', language='eng')

Listing 5-1.

Example code for extracting data from pdf, jpg, audio

Social Media

Did you know that Twitter, the online news and social networking service provider, has 320 million users, with an average of 42 million active tweets every day! (Source: Global social media research summary 2016 by smartinsights).

Let’s understand how to explore the rich information of social media (I’ll consider Twitter as an example) to explore what is being spoken about a chosen topic. Most of these forums provide API for developers to access the posts. See Figure 5-2.

A434293_1_En_5_Fig2_HTML

Figure 5-2.

Pulling Twitter posts for analysis

Step 1 – Get Access Key (One-Time Activity)

Follow the below steps to set up a new Twitter app to get consumer/access key, secret, and token (do not share the key token with unauthorized persons).

· Go to https://apps.twitter.com/

· Click on ‘Create New App’

· Fill the required information and click on ‘Create your Twitter Application’

· You’ll get the access details under ‘Keys and Access Tokens’ tab

Step 2 – Fetching Tweets

Once you have the authorization secret and access tokens, you can use the below code in Listing 5-2 to establish the connection.

#Import the necessary methods from tweepy library

import tweepy

from tweepy.streaming import StreamListener

from tweepy import OAuthHandler

from tweepy import Stream

#provide your access details below

access_token = "Your token goes here"

access_token_secret = "Your token secret goes here"

consumer_key = "Your consumer key goes here"

consumer_secret = "Your consumer secret goes here"

# establish a connection

auth = tweepy.auth.OAuthHandler(consumer_key, consumer_secret)

auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

Listing 5-2.

Twitter authentication

Let’s assume that you would like to understand what is being said about the iPhone 7 and its camera feature. So let’s pull the most recent 10 posts.

Note

You can pull historic user posts about a topic for a max of 10 to 15 days only, depending on the volume of the posts.

#fetch recent 10 tweets containing words iphone7, camera

fetched_tweets = api.search(q=['iPhone 7','iPhone7','camera'], result_type='recent', lang='en', count=10)

print “Number of tweets: ”,len(fetched_tweets)

#----output----

Number of tweets: 10

# Print the tweet text

for tweet in fetched_tweets:

print 'Tweet ID: ', tweet.id

print 'Tweet Text: ', tweet.text, '\n'

#----output----

Tweet ID: 825155021390049281

Tweet Text: RT @volcanojulie: A Tau Emerald dragonfly. The iPhone 7 camera is exceptional!

#nature #insect #dragonfly #melbourne #australia #iphone7 #...

Tweet ID: 825086303318507520

Tweet Text: Fuzzy photos? Protect your camera lens instantly with #iPhone7 Full Metallic Case. Buy now! https://t.co/d0dX40BHL6 https://t.co/AInlBoreht

You can capture useful features onto a dataframe for further analysis. See Listing 5-3.

# function to save required basic tweets info to a dataframe

def populate_tweet_df(tweets):

#Create an empty dataframe

df = pd.DataFrame()

df['id'] = list(map(lambda tweet: tweet.id, tweets))

df['text'] = list(map(lambda tweet: tweet.text, tweets))

df['retweeted'] = list(map(lambda tweet: tweet.retweeted, tweets))

df['place'] = list(map(lambda tweet: tweet.user.location, tweets))

df['screen_name'] = list(map(lambda tweet: tweet.user.screen_name, tweets))

df['verified_user'] = list(map(lambda tweet: tweet.user.verified, tweets))

df['followers_count'] = list(map(lambda tweet: tweet.user.followers_count, tweets))

df['friends_count'] = list(map(lambda tweet: tweet.user.friends_count, tweets))

# Highly popular user's tweet could possibly seen by large audience, so lets check the popularity of user

df['friendship_coeff'] = list(map(lambda tweet: float(tweet.user.followers_count)/float(tweet.user.friends_count), tweets))

return df

df = populate_tweet_df(fetched_tweets)

print df.head(10)

#---output----

id text

0 825155021390049281 RT @volcanojulie: A Tau Emerald dragonfly. The...

1 825086303318507520 Fuzzy photos? Protect your camera lens instant...

2 825064476714098690 RT @volcanojulie: A Tau Emerald dragonfly. The...

3 825062644986023936 RT @volcanojulie: A Tau Emerald dragonfly. The...

4 824935025217040385 RT @volcanojulie: A Tau Emerald dragonfly. The...

5 824933631365779458 A Tau Emerald dragonfly. The iPhone 7 camera i...

6 824836880491483136 The camera on the IPhone 7 plus is fucking awe...

7 823805101999390720 'Romeo and Juliet' Ad Showcases Apple's iPhone...

8 823804251117850624 iPhone 7 Images Show Bigger Camera Lens - I ha...

9 823689306376196096 RT @computerworks5: Premium HD Selfie Stick &a...

retweeted place screen_name verified_user

0 False Melbourne, Victoria MonashEAE False

1 False California, USA ShopNCURV False

2 False West Islip, Long Island, NY FusionWestIslip False

3 False 6676 Fresh Pond Rd Queens, NY FusionRidgewood False

4 False Iphone7review False

5 False Melbourne; Monash University volcanojulie False

6 False Hollywood, FL Hbk_Mannyp False

7 False Toronto.NYC.the Universe AaronRFernandes False

8 False Lagos, Nigeria moyinoluwa_mm False

9 False Iphone7review False

followers_count friends_count friendship_coeff

0 322 388 0.829897

1 279 318 0.877358

2 13 193 0.067358

3 45 218 0.206422

4 199 1787 0.111360

5 398 551 0.722323

6 57 64 0.890625

7 18291 7 2613.000000

8 780 302 2.582781

9 199 1787 0.111360

Listing 5-3.

Save features to dataframe

Instead of a topic you can also choose a screen_name focused on a topic; let’s look at the posts by the screen name Iphone7review. See Listing 5-4.

# For help about api look here http://tweepy.readthedocs.org/en/v2.3.0/api.html

fetched_tweets = api.user_timeline(id='Iphone7review', count=5)

# Print the tweet text

for tweet in fetched_tweets:

print 'Tweet ID: ', tweet.id

print 'Tweet Text: ', tweet.text, '\n'

#----output----

Tweet ID: 825169063676608512

Tweet Text: RT @alicesttu: iPhone 7S to get Samsung OLED display next year #iPhone https://t.co/BylKbvXgAG #iphone

Tweet ID: 825169047138533376

Tweet Text: Nothing beats the Iphone7! Who agrees? #Iphone7 https://t.co/e03tXeLOao

Listing 5-4.

Example code for extracting tweets based on screen name

Glancing through the posts quickly can generally help you conclude that there are positive comments about the camera feature of iPhone 7.

Data Preprocessing (Text)

This step deals with cleansing the consolidated text to remove noise to ensure efficient syntactic, semantic text analysis for deriving meaningful insights from text. Some common cleaning steps are briefed below.

Convert to Lower Case and Tokenize

Here, all the data is converted into lowercase. This is carried out to prevent words like “LIKE” or “Like” being interpreted as different words. Python provides a function lower() to convert text to lowercase.

Tokenizing is the process of breaking a large set of texts into smaller meaningful chunks such as sentences, words, and phrases.

Sentence Tokenizing

The NLTK library provides a sent_tokenize for sentence-level tokenizing, which uses a pre-trained model PunktSentenceTokenize, to determine punctuation and characters marking the end of sentence for European languages. See Listing 5-5.

import nltk

from nltk.tokenize import sent_tokenize

text='Statistics skills, and programming skills are equally important for analytics. Statistics skills, and domain knowledge are important for analytics. I like reading books and travelling.'

sent_tokenize_list = sent_tokenize(text)

print(sent_tokenize_list)

#----output----

['Statistics skills, and programming skills are equally important for analytics.', 'Statistics skills, and domain knowledge are important for analytics.', 'I like reading books and travelling.']

Listing 5-5.

Example code for sentence tokenization

There are a total of 17 European languages that NLTK supports for sentence tokenization. You can load the tokenized model for specific language saved as a pickle file as part of nltk.data. see Listing 5-6.

import nltk.data

spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')

spanish_tokenizer.tokenize('Hola. Esta es una frase espanola.')

#----output----

['Hola.', 'Esta es una frase espanola.']

Listing 5-6.

Sentence tokenization for European languages

Word Tokenizing

The word_tokenize function of NLTK is a wrapper function that calls tokenize by the TreebankWordTokenizer. See Listing 5-7.

from nltk.tokenize import word_tokenize

print word_tokenize(text)

# Another equivalent call method using TreebankWordTokenizer

from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()

print tokenizer.tokenize(text)

#----output----

Listing 5-7.

Example code for word tokenizing

[‘Statistics’, ‘skills’, ‘ , ’, ‘and’, ‘programming’, ‘skills’, ‘are’, ‘equally’, ‘important’, ‘for’, ‘analytics’, ‘ . ’, ‘Statistics’, ‘skills’, ‘ , ’, ‘and’, ‘domain’, ‘knowledge’, ‘are’, ‘important’, ‘for’, ‘analytics’, ‘ . ‘, ‘I’, ‘like’, ‘reading’, ‘books’, ‘and’, ‘travelling’, ‘ . ’]

Removing Noise

You should remove all information that is not comparative or relevant to text analytics. These can be seen as noise to the text analytics. Most common noises are numbers, punctuations, stop words, white space, etc

Numbers: Numbers are removed as they may not be relevant and not hold valuable information. See Listing 5-8.

def remove_numbers(text):

return re.sub(r'\d+', '', text)

text = 'This is a sample English sentence, \n with whitespace and numbers 1234!'

print 'Removed numbers: ', remove_numbers(text)

#----output----

Removed numbers: This is a sample English sentence,

with whitespace and numbers !

Listing 5-8.

Example code for removing noise from text

Punctuation: It is to be removed for better identifying each word and remove punctuation characters from the dataset. For example “like” and “like” or “coca-cola” and “cocacola” would be interpreted as different words if the punctuation was not removed. See Listing 5-9.

import string

# Function to remove punctuations

def remove_punctuations(text):

words = nltk.word_tokenize(text)

punt_removed = [w for w in words if w.lower() not in string.punctuation]

return " ".join(punt_removed)

print remove_punctuations('This is a sample English sentence, with punctuations!')

#----output----

This is a sample English sentence with punctuations

Listing 5-9.

Example code for removing punctuation from text

Stop words: Words like “the,” “and,” “or” are uninformative and add unneeded noise to the analysis. For this reason they are removed. See Listing 5-10.

from nltk.corpus import stopwords

# Function to remove stop words

def remove_stopwords(text, lang='english'):

words = nltk.word_tokenize(text)

lang_stopwords = stopwords.words(lang)

stopwords_removed = [w for w in words if w.lower() not in lang_stopwords]

return " ".join(stopwords_removed)

print remove_stopwords('This is a sample English sentence')

#----output----

sample English sentence

Listing 5-10.

Example code for removing stop words from text

Note

Remove own stop words (if required) – Certain words could be very commonly used in a particular domain. Along with English stop words, we could instead, or in addition, remove our own stop words. The choice of our own stop word might depend on the domain of discourse and might not become apparent until we’ve done some analysis.

Whitespace: Often in text analytics, an extra whitespace (space, tab, Carriage Return, Line Feed) becomes identified as a word. This anomaly is avoided through a basic programming procedure in this step. See Listing 5-11.

# Function to remove whitespace

def remove_whitespace(text):

return " ".join(text.split())

text = 'This is a sample English sentence, \n with whitespace and numbers 1234!'

print 'Removed whitespace: ', remove_whitespace(text)

#----output----

Removed whitespace: This is a sample English sentence, with whitespace and numbers 1234!

Listing 5-11.

Example code for removing whitespace from text

Part of Speech (PoS) Tagging

PoS tagging is the process of assigning language-specific parts of speech such as nouns, verbs, adjectives, and adverbs, etc., for each word in the given text.

NLTK supports multiple PoS tagging models, and the default tagger is maxent_treebank_pos_tagger, which uses Penn (Pennsylvania University) Tree bank corpus. The same has 36 possible parts of speech tags, a sentence (S) is represented by the parser as a tree having three children: a noun phrase (NP), a verbal phrase (VP), and the full stop (.). The root of the tree will be S. See Table 5-3 and Listings 5-12 and 5-13.

Table 5-3.

NLTK PoS taggers

PoS Tagger

Short Description

maxent_treebank_pos_tagger

It’s based on Maximum Entropy (ME) classification principles trained on Wall Street Journal subset of the Penn Tree bank corpus.

BrillTagger

Brill’s transformational rule-based tagger.

CRFTagger

Conditional Random Fields.

HiddenMarkovModelTagger

Hidden Markov Models (HMMs) largely used to assign the correct label sequence to sequential data or assess the probability of a given label and data sequence.

HunposTagge

A module for interfacing with the HunPos open source POS-tagger.

PerceptronTagger

Based on averaged perceptron technique proposed by Matthew Honnibal.

SennaTagger

Semantic/syntactic Extraction using a Neural Network Architecture.

SequentialBackoffTagger

Classes for tagging sentences sequentially, left to right.

StanfordPOSTagger

Researched and developed at Stanford University.

TnT

Implementation of ‘TnT - A Statistical Part of Speech Tagger’ by Thorsten Brants.

from nltk import chunk

tagged_sent = nltk.pos_tag(nltk.word_tokenize('This is a sample English sentence'))

print tagged_sent

tree = chunk.ne_chunk(tagged_sent)

tree.draw() # this will draw the sentence tree

#----output----

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'JJ'), ('English', 'JJ'), ('sentence', 'NN')]

Listing 5-12.

Example code for PoS, the sentence, and visualize sentence tree

A434293_1_En_5_Figa_HTML

# To use PerceptronTagger

from nltk.tag.perceptron import PerceptronTagger

PT = PerceptronTagger()

print PT.tag('This is a sample English sentence'.split())

#----output----

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'JJ'), ('English', 'JJ'), ('sentence', 'NN')]

# To get help about tags

nltk.help.upenn_tagset('NNP')

#----output----

NNP: noun, proper, singular

Listing 5-13.

Example code for using perceptron tagger and getting help on tags

Stemming

It is the process of transforming to the root word, that is, it uses an algorithm that removes common word endings from English words, such as “ly,” “es,” “ed,” and “s.” For example, assuming for an analysis you may want to consider “carefully,” “cared,” “cares,” “caringly” as “care” instead of separate words. There are three widely used stemming algorithms as listed in Figure 5-3. See Listing 5-14.

A434293_1_En_5_Fig3_HTML

Figure 5-3.

Most popular NLTK stemmers

from nltk import PorterStemmer, LancasterStemmer, SnowballStemmer

# Function to apply stemming to a list of words

def words_stemmer(words, type="PorterStemmer", lang="english", encoding="utf8"):

supported_stemmers = ["PorterStemmer","LancasterStemmer","SnowballStemmer"]

if type is False or type not in supported_stemmers:

return words

else:

stem_words = []

if type == "PorterStemmer":

stemmer = PorterStemmer()

for word in words:

stem_words.append(stemmer.stem(word).encode(encoding))

if type == "LancasterStemmer":

stemmer = LancasterStemmer()

for word in words:

stem_words.append(stemmer.stem(word).encode(encoding))

if type == "SnowballStemmer":

stemmer = SnowballStemmer(lang)

for word in words:

stem_words.append(stemmer.stem(word).encode(encoding))

return " ".join(stem_words)

words = 'caring cares cared caringly carefully'

print "Original: ", words

print "Porter: ", words_stemmer(nltk.word_tokenize(words), "PorterStemmer")

print "Lancaster: ", words_stemmer(nltk.word_tokenize(words), "LancasterStemmer")

print "Snowball: ", words_stemmer(nltk.word_tokenize(words), "SnowballStemmer")

#----output----

Original: caring cares cared caringly carefully

Porter: care care care caringli care

Lancaster: car car car car car

Snowball: care care care care care

Listing 5-14.

Example code for stemming

Lemmatization

It is the process of transforming to the dictionary base form. For this you can use WordNet, which is a large lexical database for English words that are linked together by their semantic relationships. It works as a thesaurus, that is, it groups words together based on their meanings. See Listing 5-15.

from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

# Function to apply lemmatization to a list of words

def words_lemmatizer(text, encoding="utf8"):

words = nltk.word_tokenize(text)

lemma_words = []

wl = WordNetLemmatizer()

for word in words:

pos = find_pos(word)

lemma_words.append(wl.lemmatize(word, pos).encode(encoding))

return " ".join(lemma_words)

# Function to find part of speech tag for a word

def find_pos(word):

# Part of Speech constants

# ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'

# You can learn more about these at http://wordnet.princeton.edu/wordnet/man/wndb.5WN.html#sect3

# You can learn more about all the penn tree tags at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

pos = nltk.pos_tag(nltk.word_tokenize(word))[0][1]

# Adjective tags - 'JJ', 'JJR', 'JJS'

if pos.lower()[0] == 'j':

return 'a'

# Adverb tags - 'RB', 'RBR', 'RBS'

elif pos.lower()[0] == 'r':

return 'r'

# Verb tags - 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'

elif pos.lower()[0] == 'v':

return 'v'

# Noun tags - 'NN', 'NNS', 'NNP', 'NNPS'

else:

return 'n'

print "Lemmatized: ", words_lemmatizer(words)

#----output----

Lemmatized: care care care caringly carefully

In the above case,‘caringly’/’carefully’ are inflected form of care and they are an entry word listed in WordNet Dictoinary so they are retained in their actual form itself.

Listing 5-15.

Example code for lemmatization

NLTK English WordNet includes approximately 155,287 words and 117000 synonym sets. For a given word, WordNet includes/provides definition, example, synonyms (group of nouns, adjectives, verbs that are similar), atonyms (opposite in meaning to another), etc. See Listing 5-16.

from nltk.corpus import wordnet

syns = wordnet.synsets("good")

print "Definition: ", syns[0].definition()

print "Example: ", syns[0].examples()

synonyms = []

antonyms = []

# Print synonums and antonyms (having opposite meaning words)

for syn in wordnet.synsets("good"):

for l in syn.lemmas():

synonyms.append(l.name())

if l.antonyms():

antonyms.append(l.antonyms()[0].name())

print "synonyms: \n", set(synonyms)

print "antonyms: \n", set(antonyms)

#----output----

Definition: benefit

Example: [u'for your own good', u"what's the good of worrying?"]

synonyms:

set([u'beneficial', u'right', u'secure', u'just', u'unspoilt', u'respectable', u'good', u'goodness', u'dear', u'salutary', u'ripe', u'expert', u'skillful', u'in_force', u'proficient', u'unspoiled', u'dependable', u'soundly', u'honorable', u'full', u'undecomposed', u'safe', u'adept', u'upright', u'trade_good', u'sound', u'in_effect', u'practiced', u'effective', u'commodity', u'estimable', u'well', u'honest', u'near', u'skilful', u'thoroughly', u'serious'])

antonyms:

set([u'bad', u'badness', u'ill', u'evil', u'evilness'])

Listing 5-16.

Example code for wordnet

N-grams

One of the important concepts in text mining is n-grams, which are fundamentally a set of co-occurring or continuous sequence of n items from a given sequence of large text. The item here could be words, letters, and syllables. Let’s consider a sample sentence and try to extract n-grams for different values of n. See Listing 5-17.

from nltk.util import ngrams

from collections import Counter

# Function to extract n-grams from text

def get_ngrams(text, n):

n_grams = ngrams(nltk.word_tokenize(text), n)

return [ ' '.join(grams) for grams in n_grams]

text = 'This is a sample English sentence'

print "1-gram: ", get_ngrams(text, 1)

print "2-gram: ", get_ngrams(text, 2)

print "3-gram: ", get_ngrams(text, 3)

print "4-gram: ", get_ngrams(text, 4)

#----output----

1-gram:['This', 'is', 'a', 'sample', 'English', 'sentence']

2-gram:['This is', 'is a', 'a sample', 'sample English', 'English sentence']

3-gram:['This is a', 'is a sample', 'a sample English', 'sample English sentence']

4-gram: ['This is a sample', 'is a sample English', 'a sample English sentence']

Listing 5-17.

Example code for extracting n-grams from sentence

Note

1-gram is also called as unigram, 2-gram, and 3-gram as bigram and trigram, respectively.

The N-gram technique is relatively simple and simply increasing the value of n will give us more contexts. It is widely used in the probabilistic language model of predicting the next item in a sequence: for example, search engines use this technique to predict/recommend the possibility of next character/words in the sequence to users as they type. See Listing 5-18.

text = 'Statistics skills, and programming skills are equally important for analytics. Statistics skills, and domain knowledge are important for analytics'

# remove punctuations

text = remove_punctuations(text)

# Extracting bigrams

result = get_ngrams(text,2)

# Counting bigrams

result_count = Counter(result)

# Converting to the result to a data frame

import pandas as pd

df = pd.DataFrame.from_dict(result_count, orient='index')

df = df.rename(columns={'index':'words', 0:'frequency'}) # Renaming index and column name

print df

#----output----

frequency

are equally 1

domain knowledge 1

skills are 1

knowledge are 1

programming skills 1

are important 1

skills and 2

for analytics 2

and domain 1

important for 2

and programming 1

Statistics skills 2

equally important 1

analytics Statistics 1

Listing 5-18.

Example code for extracting 2-grams from sentence and store it in a dataframe

Bag of Words (BoW)

The texts have to be represented as numbers to be able to apply any algorithms. Bag of words is the method where you count the occurrence of words in a document without giving importance to the grammar and the order of words. This can be achieved by creating Term Document Matrix (TDM) . It is simply a matrix with terms as the rows and document names as the columns and a count of the frequency of words as the cells of the matrix. Let’s learn about creating DTM through an example; consider three text documents with some text in it. Sklearn provides good functions under feature_extraction.text to convert a collection of text documents to a matrix of word counts. See Listing 5-19 and Figure 5-4.

A434293_1_En_5_Fig4_HTML

Figure 5-4.

Document Term Matrix

import os

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

# Function to create a dictionary with key as file names and values as text for all files in a given folder

def CorpusFromDir(dir_path):

result = dict(docs = [open(os.path.join(dir_path,f)).read() for f in os.listdir(dir_path)],

ColNames = map(lambda x: x, os.listdir(dir_path)))

return result

docs = CorpusFromDir('Data/')

# Initialize

vectorizer = CountVectorizer()

doc_vec = vectorizer.fit_transform(docs.get('docs'))

#create dataFrame

df = pd.DataFrame(doc_vec.toarray().transpose(), index = vectorizer.get_feature_names())

# Change column headers to be file names

df.columns = docs.get('ColNames')

print df

#----output----

Doc_1.txt Doc_2.txt Doc_3.txt

analytics 1 1 0

and 1 1 1

are 1 1 0

books 0 0 1

domain 0 1 0

equally 1 0 0

for 1 1 0

important 1 1 0

knowledge 0 1 0

like 0 0 1

programming 1 0 0

reading 0 0 1

skills 2 1 0

statistics 1 1 0

travelling 0 0 1

Listing 5-19.

Creating document term matrix from corpus of sample documents

Note

Document Term Matrix (DTM) is the transpose of Term Document Matrix. In DTM the rows will be the document names and column headers will be the terms. Both are in the matrix format and useful for carrying out analysis; however TDM is commonly used due to the fact that the number of terms tends to be way larger than the document count. In this case having more rows is better than having a large number of columns.

Term Frequency-Inverse Document Frequency (TF-IDF)

In the area of information retrieval, TF-IDF is a good statistical measure to reflect the relevance of the term to the document in a collection of documents or corpus. Let’s break TF_IDF and apply an example to understand it better. See Listing 5-20.

Term frequency will tell you how frequently a given term appears.

$$ \mathrm{T}\mathrm{F}\kern0.5em \left(\mathrm{term}\right)=\frac{Number\ of\ times\ term\ a ppears\ in\ a\ document}{Total\ number\ of\ term s\ in\ the\ document} $$

For example, consider a document containing 100 words wherein the word ‘ML’ appears 3 times, then TF (ML) = 3 / 100 =0.03

Document frequency will tell you how important a term is?

$$ \mathrm{D}\mathrm{F}\kern0.5em \left(\mathrm{term}\right)=\frac{d\ \left( number\ of\ documents\ containing\ a\ given\ term\right)}{D\ \left( the\ size\ of\ the\ collection\ of\ documents\right)} $$

Assume we have 10 million documents and the word ML appears in one thousand of these, then DF (ML) = 1000/10,000,000 = 0.0001

To normalize let’s take a log (d/D), that is, log (0.0001) = -4

Quite often D > d and log (d/D) will give a negative value as seen in the above example. So to solve this problem let’s invert the ratio inside the log expression, which is known as Inverse document frequency (IDF). Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared.

$$ \mathrm{I}\mathrm{D}\mathrm{F}\kern0.5em \left(\mathrm{term}\right)\kern0.5em =\kern0.5em \log \left(\frac{Total\ number\ of\ documents}{Number\ of\ documents\ with\ a\ given\ term\ in\ it}\right) $$

Continuing with the above example, IDF(ML) = log(10,000,000 / 1,000) = 4

TF-IDF is the weight product of quantities, that is, for the above example TF-IDF (ML) = 0.03 * 4 = 0.12

sklearn provides provides a function TfidfVectorizer to calculate TF-IDF for text, however by default it normalizes the term vector using L2 normalization and also IDF is smoothed by adding one to the document frequency to prevent zero divisions.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

doc_vec = vectorizer.fit_transform(docs.get('docs'))

#create dataFrame

df = pd.DataFrame(doc_vec.toarray().transpose(), index = vectorizer.get_feature_names())

# Change column headers to be file names

df.columns = docs.get('ColNames')

print df

#----output----

Doc_1.txt Doc_2.txt Doc_3.txt

analytics 0.276703 0.315269 0.000000

and 0.214884 0.244835 0.283217

are 0.276703 0.315269 0.000000

books 0.000000 0.000000 0.479528

domain 0.000000 0.414541 0.000000

equally 0.363831 0.000000 0.000000

for 0.276703 0.315269 0.000000

important 0.276703 0.315269 0.000000

knowledge 0.000000 0.414541 0.000000

like 0.000000 0.000000 0.479528

programming 0.363831 0.000000 0.000000

reading 0.000000 0.000000 0.479528

skills 0.553405 0.315269 0.000000

statistics 0.276703 0.315269 0.000000

travelling 0.000000 0.000000 0.479528

Listing 5-20.

Create document term matrix with TF-IDF

Data Exploration (Text)

In this stage the corpus is explored to understand the common key words, content, relationship, and presence of level of noise. This can be achieved by creating basic statistics and embracing visualization techniques such as word frequency count, word co-occurrence, or correlation plot, etc., which will help us to discover hidden patterns if any.

Frequency Chart

This visualization presents a bar chart whose length corresponds to the frequency a particular word occurred. Let’s plot a frequency chart for Doc_1.txt file. See Listing 5-21.

words = df.index

freq = df.ix[:,0].sort(ascending=False, inplace=False)

pos = np.arange(len(words))

width=1.0

ax=plt.axes(frameon=True)

ax.set_xticks(pos)

ax.set_xticklabels(words, rotation='vertical', fontsize=9)

ax.set_title('Word Frequency Chart')

ax.set_xlabel('Words')

ax.set_ylabel('Frequency')

plt.bar(pos, freq, width, color='b')

plt.show()

#----output----

Listing 5-21.

Example code for frequency chart

A434293_1_En_5_Figb_HTML

Word Cloud

This is a visual representation of text data, which is helpful to get a high-level understanding about the important keywords from data in terms of its occurrence. ‘WordCloud’ package can be used to generate words whose font size relates to its frequency. See Listing 5-22.

from wordcloud import WordCloud

# Read the whole text.

text = open('Data/Text_Files/Doc_1.txt').read()

# Generate a word cloud image

wordcloud = WordCloud().generate(text)

# Display the generated image:

# the matplotlib way:

import matplotlib.pyplot as plt

plt.imshow(wordcloud.recolor(random_state=2017))

plt.title('Most Frequent Words')

plt.axis("off")

plt.show()

#----output----

Listing 5-22.

Example code for wordcloud

A434293_1_En_5_Figc_HTML

From the above chart we can see that ‘skills’ appear the most number of times comparatively.

Lexical Dispersion Plot

This plot is helpful to determine the location of a word in a sequence of text sentences. On the x-axis you’ll have word offset numbers and on the y-axis each row is a representation of the entire text and the marker indicates an instance of the word of interest. See Listing 5-23.

from nltk import word_tokenize

def dispersion_plot(text, words):

words_token = word_tokenize(text)

points = [(x,y) for x in range(len(words_token)) for y in range(len(words)) if words_token[x] == words[y]]

if points:

x,y=zip(*points)

else:

x=y=()

plt.plot(x,y,"rx",scalex=.1)

plt.yticks(range(len(words)),words,color="b")

plt.ylim(-1,len(words))

plt.title("Lexical Dispersion Plot")

plt.xlabel("Word Offset")

plt.show()

text = 'statistics skills, and programming skills are equally important for analytics. statistics skills, and domain knowledge are important for analytics'

dispersion_plot(text, ['statistics', 'skills', 'and', 'important'])

#----output----

Listing 5-23.

Example code for lexical dispersion plot

A434293_1_En_5_Figd_HTML

Co-occurrence Matrix

Calculating the co-occurrence between words in a sequence of text will be helpful matrices to explain the relationship between words. A co-occurrence matrix tells us how many times every word has co-occurred with the current word. Further plotting this matrix into a heat map is a powerful visual tool to spot the relationships between words efficiently. See Listing 5-24.

import statsmodels.api as sm

import scipy.sparse as sp

# default unigram model

count_model = CountVectorizer(ngram_range=(1,1))

docs_unigram = count_model.fit_transform(docs.get('docs'))

# co-occurrence matrix in sparse csr format

docs_unigram_matrix = (docs_unigram.T * docs_unigram)

# fill same word cooccurence to 0

docs_unigram_matrix.setdiag(0)

# co-occurrence matrix in sparse csr format

docs_unigram_matrix = (docs_unigram.T * docs_unigram) docs_unigram_matrix_diags = sp.diags(1./docs_unigram_matrix.diagonal())

# normalized co-occurence matrix

docs_unigram_matrix_norm = docs_unigram_matrix_diags * docs_unigram_matrix

# Convert to a dataframe

df = pd.DataFrame(docs_unigram_matrix_norm.todense(), index = count_model.get_feature_names())

df.columns = count_model.get_feature_names()

# Plot

sm.graphics.plot_corr(df, title='Co-occurrence Matrix', xnames=list(df.index))

plt.show()

#----output----

Listing 5-24.

Example code for co-occurrence matrix

A434293_1_En_5_Fige_HTML

Model Building

As you might be familiar with by now, model building is the process of understanding and establishing relationships between variables. So far you have learned how to extract text content from various sources, preprocess to remove noise, and perform exploratory analysis to get basic understanding/statistics about the text data in hand. Now you’ll learn to apply machine learning techniques on the processed data to build models.

Text Similarity

A measure that indicates how similar two objects are is described through a distance measure with dimensions represented by features of the objects (here text). A smaller distance indicates a high degree of similarity and vice versa. Note that similarity is highly subjective and dependent on domain or application. For text similarity, it is important to choose the right distance measure to get better results. There are various distance measures available and Euclidian metric is the most common, which is a straight line distance between two points. However a significant amount of research has been carried out in the field of text mining to learn that cosine distance better suits for text similarity.

Let’s look at a simple example to understand similarity better. Consider three documents containing certain simple text keywords and assume that the top two keywords are accident and New York. For the moment ignore other keywords and let’s calculate the similarity of document based on these two keywords frequency. See Table 5-4.

Table 5-4.

Sample document term matrix

Document #

Count of ‘Accident’

Count of ‘New York’

1

2

8

2

3

7

3

7

3

Plotting the document word vector points on a two-dimensional chart is depicted on Figure 5-5. Notice that the cosine similarity equation is the representation of the angle between the two data points, whereas Euclidian distance is the square root of straight line differences between data points. The cosine similarity equation will result in a value between 0 and 1. The smaller cosine angle results in a bigger cosine value, indicating higher similarity. In this case Euclidean distance will result in a zero. Let’s put the values in the formula to find the similarity between documents 1 and 2.

A434293_1_En_5_Fig5_HTML

Figure 5-5. Euclidian vs. Cosine

Euclidian distance (doc1, doc2) = $$ \sqrt{\left(2-3\right)\hat{\mkern6mu} 2+\left(8-7\right)\hat{\mkern6mu} 2}=\sqrt{\left(1+1\right)}\kern0.5em =\kern0.5em 1.41\kern0.5em =\kern0.5em 0 $$

Cosine (doc1, doc2) = $$ \frac{62}{8.24\ *\ 7.61}\kern0.5em =\kern0.5em 0.98 $$, where

doc1 = (2,8)

doc2 = (3,7)

doc1 . doc2 = (2*3 + 8*7) = (56 + 6) = 62

||doc1|| = `$$ \sqrt{\left(2*2\right)+\left(8*8\right)}\kern0.5em =\kern0.5em 8.24 $$

||doc2|| = $$ \sqrt{\left(3*3\right)+\left(7*7\right)}\kern0.5em =\kern0.5em 7.61 $$

Similarly let’s find the similarity between documents 1 and 3. See Figure 5-5.

Euclidian distance (doc1, doc3) =$$ \sqrt{\left(2-7\right)\hat{\mkern6mu} 2+\left(8-3\right)\hat{\mkern6mu} 2}=\sqrt{\left(25+25\right)} = 7.07 = 0 $$

Cosine (doc1, doc3)= $$ \frac{38}{8.24\ *\ 7.61}\kern0.5em =\kern0.5em 0.60 $$

According to the cosine equation, documents 1 and 2 are 98% similar; this could mean that these two documents may be talking about ‘New York’. Whereas document 3 can be assumed to be focused more about ‘Accident’, however, there is a mention of ‘New York’ a couple of times resulting in a similarity of 60% between documents 1 and 3.

Let’s apply cosine similarity for the example given in Figure 5-5. See Listing 5-25.

from sklearn.metrics.pairwise import cosine_similarity

print "Similarity b/w doc 1 & 2: ", cosine_similarity(df['Doc_1.txt'], df['Doc_2.txt'])

print "Similarity b/w doc 1 & 3: ", cosine_similarity(df['Doc_1.txt'], df['Doc_3.txt'])

print "Similarity b/w doc 2 & 3: ", cosine_similarity(df['Doc_2.txt'], df['Doc_3.txt'])

#----output----

Similarity b/w doc 1 & 2: [[ 0.76980036]]

Similarity b/w doc 1 & 3: [[ 0.12909944]]

Similarity b/w doc 2 & 3: [[ 0.1490712]]

Listing 5-25.

Example code for calculating cosine similarity for documents

Text Clustering

As an example we’ll be using the 20 newsgroups dataset consisting of 18,000+ newsgroup posts on 20 topics. You can learn more about the dataset at http://qwone.com/~jason/20Newsgroups/ . Let’s load the data and check the topic names. See Listing 5-26.

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.preprocessing import Normalizer

from sklearn import metrics

from sklearn.cluster import KMeans, MiniBatchKMeans

import numpy as np

# load data and print topic names

newsgroups_train = fetch_20newsgroups(subset='train')

print(list(newsgroups_train.target_names))

#----output----

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

Listing 5-26.

Example code for text clustering

To keep it simple, let’s filter only three topics. Assume that we do not know the topics, so let’s run a clustering algorithm and examine the keywords of each cluster.

categories = ['alt.atheism', 'comp.graphics', 'rec.motorcycles']

dataset = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=2017)

print("%d documents" % len(dataset.data))

print("%d categories" % len(dataset.target_names))

labels = dataset.target

print("Extracting features from the dataset using a sparse vectorizer")

vectorizer = TfidfVectorizer(stop_words='english')

X = vectorizer.fit_transform(dataset.data)

print("n_samples: %d, n_features: %d" % X.shape)

#----output----

2768 documents

3 categories

Extracting features from the dataset using a sparse vectorizer

n_samples: 2768, n_features: 35311

Latent Semantic Analysis (LSA)

LSA is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others, it looks at all the documents as a whole and the terms within them to identify relationships. Let’s perform LSA by running SVD on the data to reduce the dimensionality.

SVD of matrix A = U * ∑ * VT

r = rank of matrix X

U = column orthonormal m * r matrix

∑ = diagonal r * r matrix with singular value sorted in descending order

V = column orthonormal r * n matrix

In our case we have 3 topics, 2768 documents and 35311 word vocabulary.

* Original matrix = 2768*35311 ∼ 108

* SVD = 3*2768 + 3 + 3*35311 ∼ 105.3

Resulted SVD is taking approximately 460 times less space than original matrix.

Note

“Latent Semantic Analysis (LSA)” and “Latent Semantic Indexing (LSI)” is the same thing, with the latter name being used sometimes when referring specifically to indexing a collection of documents for search (“Information Retrieval”). See Figure 5-6 and Listing 5-27.

A434293_1_En_5_Fig6_HTML

Figure 5-6.

Singular Value Decomposition

from sklearn.decomposition import TruncatedSVD

# Lets reduce the dimensionality to 2000

svd = TruncatedSVD(2000)

lsa = make_pipeline(svd, Normalizer(copy=False))

X = lsa.fit_transform(X)

explained_variance = svd.explained_variance_ratio_.sum()

print("Explained variance of the SVD step: {}%".format(int(explained_variance * 100)))

#----output----

Explained variance of the SVD step: 95%

Listing 5-27.

Example code for LSA through SVD

Let’s run k-means clustering on the SVD output as shown in Listing 5-28.

from __future__ import print_function

km = KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=1)

# Scikit learn provides MiniBatchKMeans to run k-means in batch mode suitable for a very large corpus

# km = MiniBatchKMeans(n_clusters=5, init='k-means++', n_init=1, init_size=1000, batch_size=1000)

print("Clustering sparse data with %s" % km)

km.fit(X)

print("Top terms per cluster:")

original_space_centroids = svd.inverse_transform(km.cluster_centers_)

order_centroids = original_space_centroids.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()

for i in range(3):

print("Cluster %d:" % i, end='')

for ind in order_centroids[i, :10]:

print(' %s' % terms[ind], end='')

print()

#----output----

Top terms per cluster:

Cluster 0: edu graphics university god subject lines organization com posting uk

Cluster 1: com bike edu dod ca writes article sun like organization

Cluster 2: keith sgi livesey caltech com solntze wpd jon edu sandvik

Listing 5-28.

k-means clustering on SVD dataset

Topic Modeling

Topic modeling algorithms enable you to discover hidden topical patterns or thematic structure in a large collection of documents. The most popular topic modeling techniques are Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF).

Latent Dirichlet Allocation (LDA)

LDA was presented by David Blei, Andrew Ng, and Michael I.J in 2003 as a graphical model. See Figure 5-7.

A434293_1_En_5_Fig7_HTML

Figure 5-7.

LDA graph model

LDA is given by $$ P\left(d,\ w\right) = P(d)*\ P\left({\uptheta}^{(d)}\Big|\alpha \right)*{\displaystyle \sum_z}P\left({\varPhi}^{(z)}\Big|\beta \right)*P\left(w\Big|z,{\varPhi}^{(z)}\ \right)*P\left(z\Big|{\uptheta}^{(d)}\right) $$

Where, F(z) = word distribution for topic,

a = Dirichlet parameter prior the per-document topic distribution,

b = Dirichlet parameter prior the per-document word distribution,

q(d) = topic distribution for a document

LDA’s objective is to maximize separation between means of projected topics and minimize variance within each projected topic. So LDA defines each topic as a bag of words by carrying out three steps described below.

Step 1: Initialize k clusters and assign each word in the document to one of the k topics.

Step 2: Re-assign word to new topic based on a) how is the proportion of words for a document to a topic, and b) how is the proportion of a topic widespread across all documents.

Step 3: Repeat step 2 until coherent topics result. See Figure 5-8 and Listing 5-29.

A434293_1_En_5_Fig8_HTML

Figure 5-8.

Latent Dirichlet Allocation (LDA)

from sklearn.decomposition import LatentDirichletAllocation

# continuing with the 20 newsgroup dataset and 3 topics

total_topics = 3

lda = LatentDirichletAllocation(n_topics=total_topics,

max_iter=100,

learning_method='online',

learning_offset=50.,

random_state=2017)

lda.fit(X)

feature_names = np.array(vectorizer.get_feature_names())

for topic_idx, topic in enumerate(lda.components_):

print("Topic #%d:" % topic_idx)

print(" ".join([feature_names[i] for i in topic.argsort()[:-20 - 1:-1]]))

#----output----

Topic #0:

edu com writes subject lines organization article posting university nntp host don like god uk ca just bike know graphics

Topic #1:

anl elliptical maier michael_maier qmgate separations imagesetter 5298 unscene appreshed linotronic l300 iici amnesia glued veiw halftone 708 252 dot

Topic #2:

hl7204 eehp22 raoul vrrend386 qedbbs choung qed daruwala ims kkt briarcliff kiat philabs col op_rows op_cols keeve 9327 lakewood gans

Listing 5-29.

Example code for LDA

Non-negative Matrix Factorization

NMF is a decomposition method for multivariate data, and is given by V = MH, where V is the product of matrices W and H. W is a matrix of word rank in the features, and H is the coefficient matrix with each row being a feature. The three matrices have no negative elements. See Listing 5-30.

from sklearn.decomposition import NMF

nmf = NMF(n_components=total_topics, random_state=2017, alpha=.1, l1_ratio=.5)

nmf.fit(X)

for topic_idx, topic in enumerate(nmf.components_):

print("Topic #%d:" % topic_idx)

print(" ".join([feature_names[i] for i in topic.argsort()[:-20 - 1:-1]]))

#----output----

Topic #0:

edu com god writes article don subject lines organization just university bike people posting like know uk ca think host

Topic #1:

sgi livesey keith solntze wpd jon caltech morality schneider cco moral com allan edu objective political cruel atheists gap writes

Topic #2:

sun east green ed egreen com cruncher microsystems ninjaite 8302 460 rtp 0111 nc 919 grateful drinking pixel biker showed

Listing 5-30.

Example code for Non-negative matrix factorization

Text Classification

The ability of representing text features as numbers opens up the opportunity to run classification machine learning algorithms. Let’s use a subset of 20 newsgroups data to build a classification model and assess its accuracy. See Listing 5-31.

categories = ['alt.atheism', 'comp.graphics', 'rec.motorcycles', 'sci.space', 'talk.politics.guns']

newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=2017, remove=('headers', 'footers', 'quotes'))

newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,

shuffle=True, random_state=2017, remove=('headers', 'footers', 'quotes'))

y_train = newsgroups_train.target

y_test = newsgroups_test.target

vectorizer = TfidfVectorizer(sublinear_tf=True, smooth_idf = True, max_df=0.5, ngram_range=(1, 2), stop_words='english')

X_train = vectorizer.fit_transform(newsgroups_train.data)

X_test = vectorizer.transform(newsgroups_test.data)

print("Train Dataset")

print("%d documents" % len(newsgroups_train.data))

print("%d categories" % len(newsgroups_train.target_names))

print("n_samples: %d, n_features: %d" % X_train.shape)

print("Test Dataset")

print("%d documents" % len(newsgroups_test.data))

print("%d categories" % len(newsgroups_test.target_names))

print("n_samples: %d, n_features: %d" % X_test.shape)

#----output----

Train Dataset

2801 documents

5 categories

n_samples: 2801, n_features: 241036

Test Dataset

1864 documents

5 categories

n_samples: 1864, n_features: 241036

Listing 5-31.

Example code text classification on 20 news groups dataset

Let’s build a simple naïve Bayes classification model and assess the accuracy. Essentially we can replace naïve Bayes with any other classification algorithm or use an ensemble model to build an efficient model. See Listing 5-32.

from sklearn.naive_bayes import MultinomialNB

from sklearn import metrics

clf = MultinomialNB()

clf = clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)

y_test_pred = clf.predict(X_test)

print 'Train accuracy_score: ', metrics.accuracy_score(y_train, y_train_pred)

print 'Test accuracy_score: ',metrics.accuracy_score(newsgroups_test.target, y_test_pred)

print "Train Metrics: ", metrics.classification_report(y_train, y_train_pred)

print "Test Metrics: ", metrics.classification_report(newsgroups_test.target, y_test_pred)

#----output----

Train accuracy_score: 0.976079971439

Test accuracy_score: 0.832081545064

Train Metrics: precision recall f1-score support

0 1.00 0.97 0.98 480

1 1.00 0.97 0.98 584

2 0.91 1.00 0.95 598

3 0.99 0.97 0.98 593

4 1.00 0.97 0.99 546

avg / total 0.98 0.98 0.98 2801

Test Metrics: precision recall f1-score support

0 0.91 0.62 0.74 319

1 0.90 0.90 0.90 389

2 0.81 0.90 0.86 398

3 0.80 0.84 0.82 394

4 0.78 0.86 0.82 364

avg / total 0.84 0.83 0.83 1864

Listing 5-32.

Example code text classification using Multinomial naïve Bayes

Sentiment Analysis

The procedure of discovering and classifying opinions expressed in a piece of text (like comments/feedback text) is called the sentiment analysis. The intended output of this analysis would be to determine whether the writer’s mindset toward a topic, product, service etc., is neutral, positive, or negative. See Listing 5-33.

from nltk.sentiment.vader import SentimentIntensityAnalyzer

from nltk.sentiment.util import *

data = pd.read_csv('Data/customer_review.csv')

SIA = SentimentIntensityAnalyzer()

data['polarity_score']=data.Review.apply(lambda x:SIA.polarity_scores(x)['compound'])

data['neutral_score']=data.Review.apply(lambda x:SIA.polarity_scores(x)['neu'])

data['negative_score']=data.Review.apply(lambda x:SIA.polarity_scores(x)['neg'])

data['positive_score']=data.Review.apply(lambda x:SIA.polarity_scores(x)['pos'])

data['sentiment']=''

data.loc[data.polarity_score>0,'sentiment']='POSITIVE'

data.loc[data.polarity_score==0,'sentiment']='NEUTRAL'

data.loc[data.polarity_score<0,'sentiment']='NEGATIVE'

data.head()

data.sentiment.value_counts().plot(kind='bar',title="sentiment analysis")

plt.show()

#----output----

ID Review polarity_score

0 1 Excellent service my claim was dealt with very... 0.7346

1 2 Very sympathetically dealt within all aspects ... -0.8155

2 3 Having received yet another ludicrous quote fr... 0.9785

3 4 Very prompt and fair handling of claim. A mino... 0.1440

4 5 Very good and excellent value for money simple... 0.8610

neutral_score negative_score positive_score sentiment

0 0.618 0.000 0.382 POSITIVE

1 0.680 0.320 0.000 NEGATIVE

2 0.711 0.039 0.251 POSITIVE

3 0.651 0.135 0.214 POSITIVE

4 0.485 0.000 0.515 POSITIVE

Listing 5-33.

Example code for sentiment analysis

A434293_1_En_5_Figf_HTML

Deep Natural Language Processing (DNLP )

First, let me clarify that DNLP is not to be mistaken for Deep Learning NLP. A technique such as topic modeling is generally known as shallow NLP where you try to extract knowledge from text through semantic or syntactic analysis approach, that is, try to form groups by retaining words that are similar and hold higher weight in a sentence/document. Shallow NLP is less noise than the n-grams; however the key drawback is that it does not specify the role of items in the sentence. In contrast, DNLP focuses on a semantic approach, that is, it detects relationships within the sentences, and further it can be represented or expressed as a complex construction of the form such as subject:predicate:object (known as triples or triplets) out of syntactically parsed sentences to retain the context. Sentences are made up of any combination of actor, action, object, and named entities (person, organizations, locations, dates, etc.). For example, consider the sentence "the flat tire was replaced by the driver." Here driver is the subject (actor), replaced is the predicate (action), and flat tire is the object (action). So the triples for would be driver:replaced:tire, which captures the context of the sentence. Note that triples are one of the forms widely used and you can form a similar complex structure based on the domain or problem at hand.

For demonstration I’ll use the sopex package , which uses Stanford Core NLP tree parser. See Listing 5-34.

from chunker import PennTreebackChunker

from extractor import SOPExtractor

# Initialize chunker

chunker = PennTreebackChunker()

extractor = SOPExtractor(chunker)

# function to extract triples

def extract(sentence):

sentence = sentence if sentence[-1] == '.' else sentence+'.'

global extractor

sop_triplet = extractor.extract(sentence)

return sop_triplet

sentences = [

'The quick brown fox jumps over the lazy dog.',

'A rare black squirrel has become a regular visitor to a suburban garden',

'The driver did not change the flat tire',

"The driver crashed the bike white bumper"

]

#Loop over sentence and extract triples

for sentence in sentences:

sop_triplet = extract(sentence)

print sop_triplet.subject + ':' + sop_triplet.predicate + ':' + sop_triplet.object

#----output----

fox:jumps:dog

squirrel:become:visitor

driver:change:tire

driver:crashed:bumper

Listing 5-34.

Example code for Deep NLP

Word2Vec

Tomas Mikolov’s lead team at Google created Word2Vec (word to vector) model in 2013, which uses documents to train a neural network model to maximize the conditional probability of a context given the word.

It utilizes two models: CBOW and skip-gram.

1. 1.

Continuous bag-of-words (CBOW) model predicts the current word from a window of surrounding context words or given a set of context words predict the missing word that is likely to appear in that context. CBOW is faster than skip-gram to train and gives better accuracy for frequently appearing words.

2. 2.

Continuous skip-gram model predicts the surrounding window of context words using the current word or given a single word, predict the probability of other words that are likely to appear near it in that context. Skip-gram is known to show good results for both frequent and rare words. Let’s look at an example sentence and create skip-gram for a window of 2 (refer to Figure 5-9). The word highlighted in yellow is the input word.

A434293_1_En_5_Fig9_HTML

Figure 5-9.

Skip-gram for window 2

You can download Google’s pre-trained model (from given link below) for word2vec, which includes a vocabulary of 3 million words/phrases taken from 100 billion words from a Google News dataset. See Listings 5-35 and 5-36.

URL: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

import gensim

# Load Google's pre-trained Word2Vec model.

model = gensim.models.Word2Vec.load_word2vec_format('Data/GoogleNews-vectors-negative300.bin', binary=True)

model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)

#----output----

[(u'queen', 0.7118192911148071),

(u'monarch', 0.6189674139022827),

(u'princess', 0.5902431607246399),

(u'crown_prince', 0.5499460697174072),

(u'prince', 0.5377321243286133)]

model.most_similar(['girl', 'father'], ['boy'], topn=3)

#----output----

[(u'mother', 0.831214427947998),

(u'daughter', 0.8000643253326416),

(u'husband', 0.769158124923706)]

model.doesnt_match("breakfast cereal dinner lunch".split())

#----output----

'cereal'

You can train a word2vec model on your own data set. The key model parameters to be remembered are size, window, min_count and sg.

size: The dimensionality of the vectors, the bigger size values require more training data, but can lead to more accurate models

sg = 0 for CBOW model and 1 for skip-gram model

min_count: Ignore all words with total frequency lower than this.

window: The maximum distance between the current and predicted word within a sentence

Listing 5-35.

Example code for word2vec

sentences = [['cigarette','smoking','is','injurious', 'to', 'health'],['cigarette','smoking','causes','cancer'],['cigarette','are','not','to','be','sold','to','kids']]

# train word2vec on the two sentences

model = gensim.models.Word2Vec(sentences, min_count=1, sg=1, window = 3)

model.most_similar(positive=['cigarette', 'smoking'], negative=['kids'], topn=1)

#----output----

[('injurious', 0.16142114996910095)]

Listing 5-36.

Example code for training word2vec on your own dataset

Recommender Systems

Personalization of the user experience has been a high priority and has become the new mantra in the consumer-focused industry. You might have observed e-commerce companies casting personalized ads for you suggesting what to buy, which news to read, which video to watch, where/what to eat, and who you might be interested in networking (friends/professionals) on social media sites. Recommender systems are the core information filtering system designed to predict the user preference and help to recommend correct items to create a user-specific personalization experience. There are two types of recommendation systems: 1) content-based filtering and 2) collaborative filtering. See Figure 5-10.

A434293_1_En_5_Fig10_HTML

Figure 5-10.

Recommender systems

Content-Based Filtering

This type of system focuses on the similarity attributes of the items to give you recommendations. This is best understood with an example, so if a user has purchased items of a particular category then other similar items from the same category are recommended to the user (refer to Figure 5-10).

An item-based similarity recommendation algorithm can be represented as shown here:

$$ {\widehat{x}}_{k, m}=\frac{{\displaystyle \sum_{i_b} si{m}_i\left({i}_m,{i}_b\right)\left({x}_{k, b}\right)}}{{\displaystyle \sum_{i_b}\left| si{m}_i\left({i}_m,{i}_b\right)\right|}} $$

Collaborative Filtering (CF)

CF focuses on the similarity attribute of the users, that is,it finds people with similar tastes based on a similarity measure from the large group of users. There are two types of CF implementation in practice: memory-based and model-based.

The memory-based recommendation is mainly based on the similarity algorithm; the algorithm looks at items liked by similar people to create a ranked list of recommendations. You can then sort the ranked list to recommend the top n items to the user.

The user-based similarity recommendation algorithm can be represented as:

$$ p{r}_{x, k}={m}_x+\frac{{\displaystyle \sum_{u_y\in {N}_x}\left({r}_{y, k}-{m}_y\right)\mathrm{sim}\left({u}_x,{u}_y\right)}}{{\displaystyle \sum_{u_y\in {N}_x}\left|\mathrm{sim}\left({u}_x,{u}_y\right)\right|}} $$

Let’s consider an example dataset of movie ratings (Figure 5-11) and apply item- and user-based recommendations to get a better understanding. See Listing 5-37.

A434293_1_En_5_Fig11_HTML

Figure 5-11.

Movie rating sample dataset

import numpy as np

import pandas as pd

df = pd.read_csv('Data/movie_rating.csv')

n_users = df.userID.unique().shape[0]

n_items = df.itemID.unique().shape[0]

print '\nNumber of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items)

#----output----

Number of users = 7 | Number of movies = 6

# Create user-item similarity matrices

df_matrix = np.zeros((n_users, n_items))

for line in df.itertuples():

df_matrix[line[1]-1, line[2]-1] = line[3]

from sklearn.metrics.pairwise import pairwise_distances

user_similarity = pairwise_distances(df_matrix, metric='cosine')

item_similarity = pairwise_distances(df_matrix.T, metric='cosine')

# Top 3 similar users for user id 7

print "Similar users for user id 7: \n", pd.DataFrame(user_similarity).loc[6,pd.DataFrame(user_similarity).loc[6,:] > 0].sort_values(ascending=False)[0:3]

#----output----

Similar users for user id 7:

3 8.000000

0 6.062178

5 5.873670

# Top 3 similar items for item id 6

print "Similar items for item id 6: \n", pd.DataFrame(item_similarity).loc[5,pd.DataFrame(item_similarity).loc[5,:] > 0].sort_values(ascending=False)[0:3]

#----output----

0 6.557439

2 5.522681

3 4.974937

Listing 5-37.

Example code for recommender system

Let’s build the user-based predict and item-based prediction for forumla as a function. Apply this function to predict ratings and use root mean squared error (RMSE) to evaluate the model performance.

See Listing 5-38.

# Function for item based rating prediction

def item_based_prediction(rating_matrix, similarity_matrix):

return rating_matrix.dot(similarity_matrix) / np.array([np.abs(similarity_matrix).sum(axis=1)])

# Function for user based rating prediction

def user_based_prediction(rating_matrix, similarity_matrix):

mean_user_rating = rating_matrix.mean(axis=1)

ratings_diff = (rating_matrix - mean_user_rating[:, np.newaxis])

return mean_user_rating[:, np.newaxis] + similarity_matrix.dot(ratings_diff) / np.array([np.abs(similarity_matrix).sum(axis=1)]).T

item_based_prediction = item_based_prediction(df_matrix, item_similarity)

user_based_prediction = user_based_prediction(df_matrix, user_similarity)

# Calculate the RMSE

from sklearn.metrics import mean_squared_error

from math import sqrt

def rmse(prediction, actual):

prediction = prediction[actual.nonzero()].flatten()

actual = actual[actual.nonzero()].flatten()

return sqrt(mean_squared_error(prediction, actual))

print 'User-based CF RMSE: ' + str(rmse(user_based_prediction, df_matrix))

print 'Item-based CF RMSE: ' + str(rmse(item_based_prediction, df_matrix))

#----output----

User-based CF RMSE: 1.0705767849

Item-based CF RMSE: 1.37392288971

y_user_based = pd.DataFrame(user_based_prediction)

# Predictions for movies that the user 6 hasn't rated yet

predictions = y_user_based.loc[6,pd.DataFrame(df_matrix).loc[6,:] == 0]

top = predictions.sort_values(ascending=False).head(n=1)

recommendations = pd.DataFrame(data=top)

recommendations.columns = ['Predicted Rating']

print recommendations

#----output----

Predicted Rating

1 2.282415

y_item_based = pd.DataFrame(item_based_prediction)

# Predictions for movies that the user 6 hasn't rated yet

predictions = y_item_based.loc[6,pd.DataFrame(df_matrix).loc[6,:] == 0]

top = predictions.sort_values(ascending=False).head(n=1)

recommendations = pd.DataFrame(data=top)

recommendations.columns = ['Predicted Rating']

print recommendations

#----output----

Predicted Rating

5 2.262497

Listing 5-38.

Example code for recommender system & accuracy evaluation

Based on user based the recommended Intersellar is recommended (5th index). Based on item based the recommended movie is Avengers Assemble (index number 1).

Model-based CF is based on matrix factorization (MF) such as Singular Value Decomposition (SVD) and Non-nagative matrix factorization (NMF) etc. Let’s look how to implement using SVD.

See Listing 5-39.

# calculate sparsity level

sparsity=round(1.0-len(df)/float(n_users*n_items),3)

print 'The sparsity level of is ' + str(sparsity*100) + '%'

import scipy.sparse as sp

from scipy.sparse.linalg import svds

# Get SVD components from train matrix. Choose k.

u, s, vt = svds(df_matrix, k = 5)

s_diag_matrix=np.diag(s)

X_pred = np.dot(np.dot(u, s_diag_matrix), vt)

print 'User-based CF MSE: ' + str(rmse(X_pred, df_matrix))

#----output----

The sparsity level of is 0.0%

User-based CF MSE: 0.015742898995

Listing 5-39.

Example code for recommender system using SVD

Note that, in our case the data set is small, hence sparsity level is 0%. I recommend you to try this method on the MovieLens 100k dataset which you can download from https://grouplens.org/datasets/movielens/100k/

Endnotes

In this step you have learned the fundamentals of the text mining process and different tools/techniques to extract text from various file formats. You also learned the basic text preprocessing steps to remove noise from data and different visualization techniques to better understand the corpus at hand. Then you learned various models that can be built to understand the relationship between the data and gain insight from it.

We also learned two important recommender system methods such as content-based filtering and collaborative filtering.

© Manohar Swamynathan 2017

Manohar SwamynathanMastering Machine Learning with Python in Six Steps10.1007/978-1-4842-2866-1_6