Pig Design Patterns (2014)

Chapter 7. Advanced Patterns and Future Work

In the previous chapter, you have studied various Big Data reduction techniques that aim to reduce the amount of data being analyzed or processed. We have explored design patterns that perform dimensionality reduction using the Principal Component Analysis technique and numerosity reduction using clustering, sampling, and histogram techniques.

In this chapter, we will start by discussing design patterns that primarily deal with text data and will explore a wide array of analytics pipelines that can be built using Pig as the key ingestion and processing engine.

We will be delving into the following patterns:

·        Clustering textual data

·        Topic discovery

·        Natural language processing

·        Classification

We will also speculate about what the future holds for Pig design patterns. These future trends analyze the kind of trends that are being followed now in the mainstream to modify Pig for specific use cases. These include where these trends will originate, what trends in data will affect current design patterns, and so on.

The clustering pattern

The clustering design pattern explores text clustering by calculating set similarities and clustering the results using Pig and Python.

Background

In the previous chapter, we examined how clustering can be used as a data reduction technique. We explored a clustering technique that deals with numeric data.

Clustering text automatically groups related documents into clusters that are similar to each other and separates documents that are different into different clusters. The primary reason clustering is performed is that if a corpus of documents is clustered, we divide the search space so that the search can be performed on the cluster containing the relevant documents. Clustering is one of the most important ways of improving search effectiveness and efficiency. Whether a group of documents is similar or different is not always clear and normally varies with the actual problem. For example, when clustering research articles, any two articles are considered similar if they share comparable thematic topics. When clustering websites, we are interested in clustering the pages according to the type of information they hold. For instance, to cluster university websites, we may want to separate professors' home pages from students' home pages and pages for courses from pages for research projects.

Text clustering works in situations where we need to organize multiple text documents into neatly tagged categories to make information retrieval easier. It can also be used for automated summarization of text corpus so that we can get a summary insight into the overall content of the corpus.

Text data has many unique properties, which require the design of specialized algorithms for classification. The following are a few distinctive characteristics of text representation:

·        The text representation is generally very high dimensional, but the underlying data is sparse. In other words, the dictionary from which the documents are drawn may contain a million words, but a given document may contain only a few hundred words.

·        These words are typically correlated with each other, implying that the principal components (or important concepts) are fewer than the words.

·        The number of words in each document varies widely, requiring the word vectors to be normalized in terms of their relative frequency of presence in the document and over the entire collection. This is typically achieved by calculating the term frequency–inverse document frequency (TF-IDF).

·        These problems have a greater impact when we need to cluster shorter sentences or tweets.

Clustering plays a key role in retrieving vital insights from social media conversations, which are predominantly unstructured and huge in volume. As social media content is generated and delivered by one customer to other customers, the velocity of content generation is also a factor to be considered when choosing a clustering mechanism. Social media is not confined to just the content created by the microblogging platform Twitter and the social networking platform Facebook alone; there are various other sources of content that are routinely created in wikis, forums, blogs, and other media sharing platforms. These platforms predominantly create text content, with the exception of Flickr and YouTube, which create image and video data. Clustering various types of social media content provides an inherent understanding of the similarity in the relationship between documents, images, videos, network links, and other contextual information.

In this design pattern, we limit ourselves to clustering text data gleaned from social media so that we can interpret from the data if there are groups of similar people in our own social network. This similarity could be owing to the same job titles, companies, or location.

Motivation

There is a wide variety of algorithms commonly used for text clustering. We have Hierarchical Agglomerative Clustering and distance-based clustering techniques, which use a similarity function to measure the closeness between any two text objects. Many clustering algorithms predominantly differ in the way the similarity measure is calculated. The following diagram depicts the most common clustering techniques for text data:

Motivation

Common text clustering algorithms

The following is a brief description of the most common text clustering techniques:

·        Hierarchical Agglomerative Clustering (HAC): This technique is useful for supporting a variety of problems that arise while searching because it creates a tree hierarchy that can be leveraged for the search process and improves the search effectiveness and efficiency.

The general concept of HAC algorithms is to combine documents into clusters based on their similarity to other documents. Hierarchical clustering algorithms successively combine groups based on the best pairwise similarity between these groups of documents. The similarity is computed between different sets of points in documents using the popular distance measures such as Euclidean, Manhattan, and Levenshtein. This corresponds to single-linkage, group-average linkage, and complete linkage clustering respectively. These algorithms are quite accurate, but they suffer from lack of efficiency.

·        Distance-based partitioning algorithms: These algorithms are generally used to create a cluster of objects where hierarchy does not play an important role. There are two widely used distance-based clustering algorithms: K-medoids and K-means. These algorithms are far less accurate than the HAC algorithms, but are a lot more efficient.

·        K-medoids clustering algorithm: This technique uses a set of k data points from the original data as central points (or medoids) around which the clusters are developed. The central aim of the algorithm is to figure out an ideal set of documents from the original set of documents around which the clusters are built. Each document is assigned to the document nearest to it in the collection. This creates an iterative set of clusters from the document set, which are successively improved by a random process. The key disadvantages of K-medoids clustering algorithms are that they are slow as they require a large number of iterations in order to achieve convergence and that they do not work well for sparse text data sets.

·        K-means clustering algorithm: This technique is quite similar to K-medoids as it also uses a set of k data points around which the clusters are built. However, unlike the K-medoids, this initial set of representative points is not obtained from the original data. The initial set of representative points is obtained from methods such as the hierarchical agglomerative clustering and partial supervision techniques. The K-means algorithm is faster than the K-medoids algorithm as it reaches the convergence in far fewer iterations. The disadvantages of using the K-means method are that it is quite dependent on the accuracy of the initial set of seeds and that the centroids for a given cluster of documents may contain a large number of words.

Pig is used to ingest the source data and apply several standard transformations to the term vector representation.

1.    Remove the stop words. These are words that are nondescriptive for the topic of a document (such as a, an, is, and the).

2.    Stem the words using Porter stemmer so that words with different endings are mapped into a single word.

3.    Measure the effect of containing rare terms in the document on the overall clustering ability and then decide to discard words that appear with less than a specified threshold frequency.

The following is a quick overview of some of the string similarity measures used to compute the closeness of strings for clustering purposes:

·        Edit distance or Levenshtein distance: This calculates the dissimilarity of strings by counting the minimum number of replacements required to transform one string to another

·        Jaccard similarity: This is calculated by dividing the number of items in common between the two sets by the total number of distinct items in the two sets

·        Measuring Agreement on Set-valued Items (MASI): This distance returns a shorter distance than the Jaccard similarity when there is partial overlapping between the sets

Use cases

The clustering design pattern can be used for the following purposes:

·        Clustering data after retrieval to present more organized results to the user

·        Creating hierarchical taxonomies of documents based on their similarity for browsing purposes

Pattern implementation

The use case depicted in this pattern clusters Outlook contacts with similar job titles. Conceptually, this pattern identifies which of your contacts are similar based on an arbitrary criterion such as job title. For example, this pattern can be extended to answer which of your connections have worked in companies you want to work for or where most of your connections reside geographically.

The Pig script loads the data and performs transformations on it by replacing abbreviations with their full forms and passes distinct job titles to the Python script via streaming. Passing distinct job titles ensures that the amount of data sent to the Python script is reduced.

The bulk of the clustering code is implemented in Python, which has ready-made support for clustering. The Python script is invoked via streaming in the Reduce phase. The job titles passed by Pig are read by the Python script from stdin and MASI distance is calculated. Clustering is done on the job title based on the distance and threshold and then the clustered job titles are written to stdout.

The Pig script reads the values written by Python to stdout and performs the name and job title association by fetching the job titles from the data available in the Pig relation.

We have explored several methods to calculate string similarity in order to cluster the job titles and zeroed in on the MASI distance for implementation in this pattern. This distance measure is deemed appropriate for our current use cases where there are overlaps in the job titles.

Code snippets

To illustrate the working of this pattern, we have exported contact names and their job titles from Outlook into a CSV file and de-identified the names. This file is stored on the HDFS.

Note

All the external Python modules that are not in the default Python path should be added to the PYTHONPATH environment variable before execution of the script.

The following code snippet is the Pig script illustrating the implementation of this pattern:

 /*

Assign alias cluster_contacts to the streaming command

Use SHIP to send the streaming binary files (Python script) from the client node to the compute node

*/

DEFINE cluster_contacts 'cluster_contacts.py' SHIP ('cluster_contacts.py');

/*

Register the piggybank jar file

*/

REGISTER '/home/cloudera/pig-0.11.0/contrib/piggybank/java/piggybank.jar';

/*

Load the outlook_contacts.csv dataset into the relation outlook_contacts

*/

outlook_contacts = LOAD '/user/cloudera/pdp/datasets/advanced_patterns/outlook_contacts.csv' USING PigStorage(',') AS (name: chararray, job_title: chararray);

/*

Transform the job titles by replacing few abbreviations with their full forms

*/

transformed_job_titles = FOREACH outlook_contacts {

job_title_sr = REPLACE(job_title,'Sr', 'Senior');

.

.

job_title_vp = REPLACE(job_title_cfo,'VP', 'Vice President');

GENERATE name AS name,job_title_vp AS job_title;

}

/*

Trim spaces for the field job_title

*/

jt_trimmed = FOREACH transformed_job_titles GENERATE TRIM(job_title) AS job_title,name;

/*

Group outlook_contacts by job_title

Extract unique job titles and store into the relation jt_flattened

STREAM is used to send the data to the external script

The Python script executes as a reduce job as STREAM is called after GROUP BY

The result is stored in the relation clustered_jt

*/

jt_trimmed_grpd = GROUP jt_trimmed BY job_title;

jt_flattened = FOREACH jt_trimmed_grpd GENERATE flatten(group);

clustered_jt = STREAM jt_flattened THROUGH cluster_contacts;

/*

Clustered job titles from relation clustered_jt are typecasted to chararray and are assigned to relation clustered_jt_cast.

clustered_jt_cast relation contains job title clusters. 

*/

clustered_jt_cast = FOREACH clustered_jt GENERATE (chararray)$0 AS cluster;

/*

The job titles are tokenized by using comma and are assigned to the relation clustered_jt_tokens along with the cluster name.

*/

clustered_jt_tokens  = FOREACH clustered_jt_cast GENERATE TOKENIZE(cluster,','), cluster;

/*

Each job title in job cluster is converted into a new tuple and is assigned to relation clustered_jt_flattened along with the cluster name.

*/

clustered_jt_flattened = FOREACH clustered_jt_tokens  GENERATE FLATTEN($0) AS cluster_job, cluster;

/*

Trim spaces in the job titles.

*/

clustered_jt_trimmed  = FOREACH clustered_jt_flattened GENERATE TRIM(cluster_job) AS cluster_job, cluster;

/*

Join jt_trimmed relation by job_title with the relation clustered_jt_trimmed by cluster_job. Project the contact name and cluster name.

*/

jt_clustered_joind = JOIN jt_trimmed BY job_title,clustered_jt_trimmed  BY cluster_job;

name_clustered_jt = FOREACH jt_clustered_joind GENERATE jt_trimmed::name AS name, clustered_jt_trimmed::cluster AS cluster;

/*

Remove duplicate tuples from relation name_clustered_jt.

*/

uniq_name_clustered_jt  = DISTINCT name_clustered_jt;

/*

Group the relation uniq_name_clustered_jt by field cluster and project the cluster name(consisting of a set of job titles) and the contact name

*/

name_clustered_jt_grpd =  GROUP uniq_name_clustered_jt  BY cluster;

similar_jt_clusters= FOREACH name_clustered_jt_grpd GENERATE group AS clustername, uniq_name_clustered_jt.name AS name;

/*

The results are stored on the HDFS in the directory clustering

*/

STORE similar_jt_clusters into '/user/cloudera/pdp/output/advanced_patterns/clustering';

The following is the Python code snippet illustrating the implementation of this pattern:

#! /usr/bin/env python

# Import required modules

import sys

import csv

from nltk.metrics.distance import masi_distance

# Set the distance function to use and the distance threshold value

DISTANCE_THRESHOLD = 0.5

DISTANCE = masi_distance

def cluster_contacts_by_title():

# Read data from stdin and store in a list called contacts

    contacts = [line.strip() for line in sys.stdin]

    for c in contacts[:]:

        if len(c)==0 :

            contacts.remove(c)

# create list of titles to be clustered (from contacts list)

    all_titles = []

    for i in range(len(contacts)):

        title = [contacts[i]]

        all_titles.extend(title)

    all_titles = list(set(all_titles))

    # calculate masi_distance between two titles and cluster them based on the distance threshold, store them in dictionary variable called clusters

    clusters = {}

    for title1 in all_titles:

        clusters[title1] = []

        for title2 in all_titles:

            if title2 in clusters[title1] or clusters.has_key(title2) and title1 \

                in clusters[title2]:

                continue

            distance = DISTANCE(set(title1.split()), set(title2.split()))

            if distance < DISTANCE_THRESHOLD:

                clusters[title1].append(title2)

    # Flatten out clusters

    clusters = [clusters[title] for title in clusters if len(clusters[title]) > 1]

    # Write the cluster names to stdout

    for i in range(len(clusters)):

        print ", ".join(clusters[i])

Results

The following is a snippet of the results after executing the code in this pattern on the dataset. We have shown only a few of the clusters to improve readability. The comma separated list on the left shows the clustered job titles, while the names associated with the job titles are displayed on the right.

IT Analyst, IT Financial Analyst    {(Name268),(Name869)}

Delivery Head, Delivery Unit Head    {(Name631),(Name662)}

Data Scientist, Lead Data Scientist    {(Name50),(Name823),(Name960),(Name314),(Name124),(Name163),(Name777),(Name58),(Name695)}

Lead Analyst, Lead Business Analyst    {(Name667),(Name495),(Name536),(Name952)}

Pega Practice Head, M2M Practice Head    {(Name618),(Name322)}

Technical Lead, Lead Technical Writer    {(Name52),(Name101),(Name120),(Name969),(Name683)}

Vice President, Vice President Sales    {(Name894),(Name673),(Name72)}

Business Analyst, Lead Business Analyst    {(Name536),(Name847)}

Director - Presales, Director - Staffing    {(Name104),(Name793)}

Product Manager, Senior. Product Manager    {(Name161),(Name956)}

Technology Lead, Technology Lead Service    {(Name791),(Name257)}

In the following diagram, we have graphically represented a few of the clusters to improve readability:

Results

Clustering output

As shown in the preceding diagram, the first cluster consists of two contacts with the job titles Big Data scientist – Architect and Data Architect – Big Data. These two titles are similar and hence the contacts are grouped into a single cluster.

Additional information

The complete code and datasets for this section can be found in the following GitHub directories:

·        Chapter7/code/

·        Chapter7/datasets/

The topic discovery pattern

The topic discovery design pattern explores one way of classifying a corpus of text by the technique called Latent Dirichlet Allocation (LDA) using Pig and Mahout.

Background

The discovery of the hidden topic in a corpus of text is one of the latest developments in the field of natural language processing. The data posted on social media sites generally covers a wide array of subjects. However, in order to extract relevant information from these sites, we have to classify the text corpus based on the relevance of the topics hidden in the text. This will enable automated summarization of a large amount of text and find what it is really about. Prior knowledge of the topics that are thus discovered is used to classify new documents.

Motivation

The key difficulty topic models solve is that of classifying a text corpus and identifying its topic in the absence of any prior knowledge of its contents. Prior knowledge implies that the document has not been labeled before as belonging to a particular topic. Topic models use statistical methods to discover topics hidden in the text corpus.

Latent Dirichlet Allocation (LDA) is an implementation of the topic models that works by initially identifying topics from a set of words contained in a document and then grouping the documents into combinations of topics.

LDA uses a TF-vector space model to identify the meaning of the word based on its context rather than frequency. Using LDA, the word's intent is resolved by removing ambiguities. LDA uses contextual clues to connect words with the same meaning and differentiate between usages of words with multiple meanings.

We can form an intuitive understanding of topic models by considering a case where it is easy for humans to comprehend that the words "penicillin" and "antibiotics" will appear more often in documents about medicine and the words "code" and "debugger" will appear more often in documents about software. Topic models try to glean the topic from the corpus based on the word distributions and document distributions of the topics.

Let us consider the following statements:

·        I ate oats and carrots for breakfast

·        I love eating oranges and carrots

·        Puppies and kittens are cute

·        My brother brought a puppy home

·        The cute rabbit is chewing a piece of carrot

LDA automatically discovers the topics these sentences contain. As an example, if we perform LDA on these sentences and perform a query for the discovered topics, the output might be as follows:

Topic A: 30% oats, 15% carrots, 10% breakfast, 10% chewing, … (this topic could be interpreted to be about food)

Topic B: 20% Puppies, 20% kittens, 20% cute, 15% rabbit, ... (this topic could be interpreted to be about cute animals)

Sentences 1 and 2: 100% Topic A

Sentences 3 and 4: 100% Topic B

Sentence 5: 60% Topic A, 40% Topic B

Pig is the glue that connects the raw data and LDA algorithm by pre-processing the data and converting it into a format amenable to the application of the LDA algorithm. It comes in handy to quickly ingest the right data from various sources, cleanse it, and transform it into the necessary format. Pig manufactures the dataset from the raw data and sends it to the LDA implementation script.

Use cases

You can consider using this design pattern on an unstructured text corpus to explore the latent intent and summarization. This pattern can also be considered in cases where we are not aware of the contents of the text corpus and cannot classify it based on a supervised classification algorithm so that we can understand even the latent topics.

Pattern implementation

To implement this pattern, we have considered a set of articles on Big Data and medicine, and we intend to find the topics inherent in the documents. This design pattern is implemented in Pig and Mahout. It illustrates one way of implementing the integration of Pig with Mahout to ease the problem of vectorizing the data and converting it into a Mahout-readable format, allowing quick prototyping. We have deliberately omitted the steps for pre-processing and vector conversion as we have already seen an example illustrating these steps in Chapter 6Understanding Data Reduction Patterns.

The sh command in Pig is used to invoke Mahout commands that perform the pre-processing, create sparse vectors, and apply Collapsed Variational Bayes (CVB), which is Mahout's implementation of LDA for topic modeling. The resultant list of words, along with their probabilities, is returned for each topic.

Code snippets

To illustrate the working of this pattern, we have considered a dataset with a couple of articles on Big Data and medicine. The files are stored on HDFS. For this pattern, we will be applying topic modeling on the text corpus to identify the topics.

The following code snippet is the Pig code illustrating the implementation of this pattern:

/*

Register piggybank jar file

*/

REGISTER '/home/cloudera/pig-0.11.0/contrib/piggybank/java/piggybank.jar';

/*

*Ideally the following data pre-processing steps have to be generally performed on the actual data, we have deliberately omitted the implementation as these steps were covered in the respective chapters

*Data Ingestion to ingest data from the required sources

*Data Profiling by applying statistical techniques to profile data and find data quality issues

*Data Validation to validate the correctness of the data and cleanse it accordingly

*Data Transformation to apply transformations on the data.

*Data Reduction to obtain a reduced representation of the data.

*/

/*

We have deliberately omitted the steps for vector conversion as we have an example illustrating these in the chapter Understanding Data Reduction Patterns.

*/

/*

Use sh command to execute shell commands.

Convert the files in a directory to sequence files

-i specifies the input directory on HDFS

-o specifies the output directory on HDFS

*/

sh /home/cloudera/mahout-distribution-0.8/bin/mahout seqdirectory -i /user/cloudera/pdp/datasets/advanced_patterns/lda -o /user/cloudera/pdp/output/advanced_patterns/lda/sequence_files

/*

Create sparse vectors

-i specifies the input directory on HDFS

-o specifies the output directory on HDFS

-nv to get the named vectors

*/

sh /home/cloudera/mahout-distribution-0.8/bin/mahout seq2sparse -i /user/cloudera/pdp/output/advanced_patterns/lda/sequence_files -o /user/cloudera/pdp/output/advanced_patterns/lda/sparse_vectors -nv -wt tf

/*

Use rowid to convert the sparse vectors by changing the text key to integer

-i specifies the input directory on HDFS

-o specifies the output directory on HDFS

*/

sh /home/cloudera/mahout-distribution-0.8/bin/mahout rowid -i /user/cloudera/pdp/output/advanced_patterns/lda/sparse_vectors/tf-vectors/ -o /user/cloudera/pdp/output/advanced_patterns/lda/matrix

/*

Use Collapsed Variational Bayes for topic modelling

-i specifies the input directory on HDFS

-o specifies the output directory on HDFS

-k specifies the number of topics

-x specifies the maximum number of iterations

-dict specifies the path to term dictionary

-dt specifies the path to document topic distribution

-mt specifies temporary directory of the model, this is useful when restarting the jobs

*/

sh /home/cloudera/mahout-distribution-0.8/bin/mahout cvb -i /user/cloudera/pdp/output/advanced_patterns/lda/matrix/matrix -o /user/cloudera/pdp/output/advanced_patterns/lda/lda-out -k 2 -x 5 -dict /user/cloudera/pdp/output/advanced_patterns/lda/sparse_vectors/dictionary.file-* -dt /user/cloudera/pdp/output/advanced_patterns  /lda/lda-topics -mt /user/cloudera/pdp/output/advanced_patterns/  lda/lda-model

/*

Display top ten words along with their probabilities for each topic

-i specifies the input directory on HDFS

-d specifies the path to the dictionary file

-dt specifies the type of the dictionary (sequence / text)

-sort sorts the Key/Value pairs in descending order

*/

sh /home/cloudera/mahout-distribution-0.8/bin/mahout vectordump -i /user/cloudera/pdp/output/advanced_patterns/lda/lda-out -d /user/cloudera/pdp/output/advanced_patterns/lda/sparse_vectors/dictionary.file-* -dt sequencefile -vs 10 -sort /user/cloudera/pdp/output/advanced_patterns/lda/lda-out

Results

The following is a snippet of the results after executing the code in this pattern on the dataset:

Topic 1: {examination:0.11428571430112491,medical:0.09999999999299336,follow:0.057142857068596745,may:0.057142857068595974,patient:0.05714285706859565,order:0.05714285706858435,tests:0.042857142760463936,physical:0.04285714276045852,signs:0.04285714276044089,other:0.028571428452333902}

Topic 2:

  {data:0.14754098319799064,parallel:0.0983606554177082,processing:0.08196721282428095,mapreduce:0.08196721282428092,big:0.06557377023085392,framework:0.06557377023085392,architecture:0.06557377023085391,use:0.032786885044002005,end:0.032786885044002005,type:0.032786885044002005}

The preceding result indicates discovery of two topics (Topic 1 and Topic 2) in the document and the list of the top ten words along with their probabilities for each topic. These topics are related to Big Data and medicine.

Additional information

The complete code and dataset for this section can be found in the following GitHub directories:

·        Chapter7/code/

·        Chapter7/datasets/

More information on Mahout's implementation of LDA can be found at https://mahout.apache.org/users/clustering/latent-dirichlet-allocation.html.

The natural language processing pattern

This design pattern explores the implementation of natural language processing on unstructured text data using Pig.

Background

Information retrieval from unstructured data, such as blogs and articles, revolves around extracting meaningful information from huge chunks of un-annotated text. The core goal of information retrieval is to extract structured information from unstructured text. This structured information is indexed to optimize the search. For example, consider the following sentence:

"Graham Bell invented the telephone in 1876"

The preceding sentence is used to extract the following structured information:

Inventorof (Telephone, Graham Bell)

InventedIn(Telephone, 1876)

There are a number of ways in which information retrieval can be performed on a corpus of text. We have studied in The unstructured text profiling pattern section of Chapter 3Data Profiling Patterns, how a bag of words model based on TF-IDF helps to decompose a document into word frequencies and makes information retrieval possible by accessing the document in which a word is frequent. One of the glaring shortcomings of this model, based on TF-IDF, is that it does not require deep semantic understanding of data. Instead, these models are concerned with the syntax of the words that were separated by whitespace to break the document into a bag of words and use frequency and simple similarity metrics to determine which words were likely to be important in the data. Even though these techniques are used for a wide variety of applications, they fail in cases where we have to retrieve information dealing with the context of the data.

As an illustration, biomedical researchers often examine a large number of medical publications to glean discoveries related to genes, proteins, or other biomedical entities. To enable this effort, a simple search using keyword matching (such as TF-IDF) may not be adequate, because many biomedical entities have synonyms and ambiguous names; this makes it hard to accurately retrieve relevant documents. It is a critical task in biomedical literature mining to identify biomedical entities from text based on semantics or context and to link them to their corresponding entries in existing knowledge bases. In this design pattern, we will explore extraction of named entities from an unstructured corpus using Pig and Python.

Motivation

The two fundamental tasks of context-sensitive decomposition of data using natural language processing are named entity recognition and relation extraction.

Named entity recognition is a technique for identifying names of entities, such as "Obama", "president", and "Washington", from unstructured text and classifying them into predefined types, such as people, job, and locations. Named entity recognition generally cannot be performed using string matching since the entities of a given type can be unlimited and also since the type of the named entity can be context dependent. In the previous example, the entity "Washington" can belong to the entity types, Location or Person; to correctly determine the correct entity type, its context has to be considered. Named entity recognition is the foundational task for information extraction. The extraction of other information structures, such as relationships and events, depends on accuracy of named entity recognition as a pre-processing step.

Typically, named entity recognition is implemented using statistical sequence labeling algorithms, such as maximum entropy models, hidden Markov models, and conditional random fields.

The following are the high-level steps involved in performing named entity recognition:

Motivation

Named entity recognition

The following is a brief description of the steps involved in an NLP pipeline:

·        End of sentence detection: This is the first step toward processing the corpus. It is performed on the entire corpus of text to split it into a collection of meaningful sentences. This step overcomes the ambiguities involved in the end-of-sentence detection where a period or other punctuation mark denotes the end of sentences and other abbreviations.

·        Tokenization: This operates on single sentences and converts them into tokens.

·        Parts-of-speech tagging: This assigns information about parts of speech (such as nouns, verbs, and adjectives) to each token. The parts of speech listed as a result in this step will be grouped together (for example, all the nouns may be grouped). This grouping will eventually help reasoning about the types of entities they belong to (for example, people, places, and organizations).

·        Chunking: This performs a series of tasks such as finding noun groups and verb groups, and completes partitioning of sentences into groups of different types.

·        Extraction: This analyzes each chunk and tags it as an entity type, such as people, places, and organizations.

The previously mentioned steps to extract entities enable us to use these entities as the basis of analysis as opposed to document-centric analysis involving keyword searches and frequency analysis. One simple way to do this analysis would be to extract all the nouns and noun phrases from the document, and index them as entities appearing in the documents.

In this design pattern, Pig is used to ingest the source data and preprocess it before the NLP algorithm is applied and the parts of speech or entities are identified.

Use cases

This design pattern can be used to address the needs of the following problem areas:

·        Extracting the financial or biomedical information from news or other text corpus

·        Extracting entities to automatically summarize text and creating new text by combining information from multiple documents

·        Detection of certain sequences in text, which are needed prior to text clustering or indexing

Pattern implementation

To implement this pattern, we have considered a text data set containing some text on the invention of the telephone. The objective of the code is to extract named entities from the document.

This design pattern is implemented by integrating Pig and Python. Python has extensive support for processing natural language through its NLTK toolkit. The Pig script loads a text file and passes this relation to a Python script via streaming. Python's NLTK library has built-in functions to tokenize sentences and words. Its pos_tag function tags parts of speech for each token; the chunking operation finds the noun and verb groups and tags them with entity types such as people, organizations, and places. The Python script uses these functions of the NLTK library and returns the named entities to the Pig script.

Code snippets

To illustrate the working of this pattern, we have considered a text dataset extracted from the Wikipedia article on the invention of the telephone. The file is stored on HDFS. For this pattern, we will be using Pig and Python to extract named entities.

Note

All the external Python modules not in the default Python path should be added to the PYTHONPATH environment variable before the execution of the script.

The following code snippet is the Pig code illustrating the implementation of this pattern:

 /*

Assign alias ner to the streaming command

Use SHIP to send the streaming binary files (Python script) from the client node to the compute node

*/

DEFINE ner 'named_entities.py' SHIP ('named_entities.py');

/*

Load the dataset into the relation data

*/

data = LOAD '/user/cloudera/pdp/datasets/advanced_patterns/input.txt';

/*

STREAM is used to send the data to the external script

The result is stored in the relation extracted_named_entities

*/

extracted_named_entities = STREAM data THROUGH ner;

/*

The results are stored on the HDFS in the directory nlp

*/

STORE extracted_named_entities INTO '/user/cloudera/pdp/output/advanced_patterns/nlp';

The following code snippet is the Python code illustrating the implementation of this pattern:

 #! /usr/bin/env python

# Import required modules

import sys

import string

import nltk

# Read data from stdin and store it as sentences

for line in sys.stdin:

    if len(line) == 0: continue

    sentences = nltk.tokenize.sent_tokenize(line)

    # Extract words from sentences

    words = [nltk.tokenize.word_tokenize(s) for s in sentences]

    # Extract Part of Speech from words

    pos_words = [nltk.pos_tag(t) for t in words]

    # Chunk the extracted Part of Speech tags

    named_entities = nltk.batch_ne_chunk(pos_words)

    # Write the chunks to stdout

    print named_entities[0]

Results

The following is a snippet of the results after executing the code in this pattern on the dataset. The tag NNP indicates a noun that is part of a noun phrase, VBD indicates a verb that's in simple past tense, and JJ indicates an adjective. For more information on the tags, refer to the Penn Treebank Project, which provides a full summary. The following is a snippet of part of speech tags that are returned:

(S

  (PERSON Alexander/NNP)

  (PERSON Graham/NNP Bell/NNP)

  was/VBD

  awarded/VBN

  a/DT

  patent/NN

  for/IN

  the/DT

  electric/JJ

  telephone/NN

  by/IN

  (ORGANIZATION USPTO/NNP)

  in/IN

  March/NNP

  1876/CD

  ./.)

We have redrawn the results for better readability, as shown in the following diagram:

Results

The named entity recognition output

The parts of speech tagging is done for each word. Alexander Graham Bell is identified as a person and part of speech tagging is done as an NNP (noun that is part of a noun phrase), which indicates a proper noun. USPTO is identified as an organization and is tagged as a proper noun.

Additional information

The complete code and dataset for this section can be found in the following GitHub directories:

·        Chapter7/code/

·        Chapter7/datasets/

Additional parts of speech tagging information is at http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

The classification pattern

This design pattern explores the implementation of classification using Pig and Mahout.

Background

Classification is one of the core concepts of predictive analytics; it is a technique in which data is labeled into categories or groups according to its characteristics. Classification is a simplified way to make decisions based on the data and its attributes. For example, in a survey questionnaire, we choose an appropriate answer or select a particular check box for a given question. Here, we are making a decision from a finite group of choices (check boxes or answers) provided to us. Sometimes, the number of choices can be as small as two (yes/no). In these cases, classification uses specific information on the input data to choose a single output from a group of predetermined responses.

Consider the case of a human being making a decision to buy a pizza. The input data for this decision making includes the price, toppings, type of crust, and so on, for multiple pizzas, and the group of predetermined choices includes to buy and don't buy. Classification helps the person involved to efficiently make the decision by looking at the input information and choosing to buy a pizza if it suits his taste and is within his specific price limit.

Using machine learning for classification, we can train the machine-learning algorithm to mimic human thought and perform automated decisions based on the input data characteristics. These algorithms work best when they have to decide a single output from a short list of categorical values based on the specific input characteristics.

The well-known example of predictive analysis using classification is spam detection where the machine learning algorithm uses the details of past e-mails that were labeled as spam and combines this with the attributes of e-mail messages to decide whether new messages are spam or not spam. Similarly, in the case of credit card fraud detection, the past history of fraudulent transactions and the attributes of the current transaction are used to decide whether the transaction is fraudulent or not.

All the classification algorithms learn how to decide based on the examples (past data). The accuracy of the decision making depends on the accuracy of the examples fed into the classification algorithm and also the quality of the input data.

Motivation

Classification is a three-step process: trainingtesting, and production. The training and testing are preproduction steps, which specifically use historical data to build and refine the model. This data has already been labeled with the decision (say spam or not spam). The historical data is divided into two buckets, one for building the training model and the other for testing. The training data is approximately 80 to 90 percent of the historical data and the rest is testing data. The decisions in the testing bucket are deliberately removed.

·        Training: The input for the training step consists of example data labeled with known decisions. Based on the known decisions and the input data characteristics, the trained model performs classification in the testing step. The training model is the most important artifact in the classification engine, and it is tuned to predict as accurately as possible by supplying it with appropriately labeled example data.

·        Testing: The input for the testing step is the trained model from the previous step plus the new examples that were withheld from the training step that have the decisions deliberately removed. As a result of the testing step, the model chooses a decision and these decisions are evaluated for accuracy. This evaluation is done by comparing known results with the results from the model. This step has a bearing on the performance on the model, which is revised accordingly. Once the model performs as expected, it is deployed into production, where more unlabelled examples are given to it.

·        Production: The input to the production step is a set of new example data whose decision is unknown. The model deployed in production uses the inference formed out of the training and testing phase to perform the actual classification. The output of this phase is generally in line with the precision of the results obtained in the testing phase unless there is a drastic change in the input values or poor data quality. Occasionally, the samples of the production data are taken to be used as new training data so that the classification model is updated and deployed back into production.

Motivation

The classification process

The performance of the classification exercise can be understood by the confusion matrix. The confusion matrix contains the values of the decisions made by the model (predicted class) and the actual decisions (actual class). It generally has two rows and two columns that report the number of true positives, false negatives, false positives, and true negatives. The columns of the confusion matrix represent the predicted class, and the rows represent the actual class.

For example, if the model needs to classify e-mails as SPAM and NOT_A_SPAM and the document is actually SPAM but the model classified it as NOT_A_SPAM, then the confusion matrix is as follows:

Motivation

The confusion matrix

In the preceding confusion matrix illustration, the diagonal contains the counts of e-mails that the model has correctly classified and the off-diagonal contains the wrongly classified instances. The perfect classifier will have all entries classified correctly and hence will have all the counts along the diagonal.

Classification can be performed using a variety of algorithms. Each of these algorithms differs broadly based on the type of input data it can handle (such as skewed data and uniform data), the amount of data, explainability of the results, the number of attributes (with high dimensional space), the number of classifiers (such as binary yes/no or multiclassifier), speed of training and classification, parallelizablility, and so on.

The following diagram shows a snapshot of the most important classification algorithms. These algorithms have different trade-offs in terms of effectiveness, efficiency, and applicability for a given problem set.

Motivation

A few classification algorithms

Pig is an extremely useful language to implement the classification pipeline in production. It comes in handy to quickly explore the data, assign the right schema, ingest the right data from various sources, cleanse it, integrate the data, and transform it into the necessary format. Pig manufactures the dataset from the raw data so that classification is performed on this ready-made set.

Use cases

This design pattern can be used to address the needs of the following problem areas, but is not limited to them:

·        Spam filtering

·        Fraud detection

·        Sentiment analysis

Pattern implementation

This design pattern is implemented in Pig and Mahout. It illustrates one way of implementing integration of Pig with Mahout to ease the problem of vectorizing the data and converting it into a Mahout-readable format, allowing quick prototyping. We have deliberately omitted the steps for pre-processing and vector conversion as we have already seen an example illustrating these steps in Chapter 6Understanding Data Reduction Patterns.

Typically, the data profiling, validation and cleansing, and transformation and reduction steps can be applied using the Pig script before sending it to Mahout. In our use case, we made the assumption that the data has already been profiled, cleansed, and transformed.

The data is divided into training and test data in the ratio of 80:20. The training data is used to train the model, and the test data is used to test the model's accuracy of prediction. The decision tree model is built on the training data and is applied to the test data. The resultant matrix shows the comparison between the predicted and actual results.

Code snippets

To illustrate the working of this pattern, we have considered the German credit dataset in the UCI format. There are 20 attributes (7 numerical and 13 categorical) with 1000 instances. The file is stored on the HDFS. For this pattern, we will be using Pig and Mahout to train the model to classify people as good or bad customers based on a set of attributes; the prediction will then be tested on test data.

The following is the Pig script illustrating the implementation of this pattern:

 /*

Register piggybank jar file

*/

REGISTER '/home/cloudera/pig-0.11.0/contrib/piggybank/java/piggybank.jar';

/*

*The following data pre-processing steps have to be performed here, we have deliberately omitted the implementation as these steps were covered in the respective chapters

*Data Ingestion to ingest data from the required sources

*Data Profiling by applying statistical techniques to profile data and find data quality issues

*Data Validation to validate the correctness of the data and cleanse it accordingly

*Data Transformation to apply transformations on the data.

*Data Reduction to obtain a reduced representation of the data.

*/

/*

We have deliberately omitted the steps for vector conversion as we have an example illustrating these in the chapter Understanding Data Reduction Patterns.

*/

/*

Use sh command to execute shell commands.

Generate file descriptor for the training dataset

The string C N 2 C N 2 C N 2 C N C N 2 C N C N 2 C L provides the description of the data.

C specifies that the first attribute is Categorical, it is followed by N specifying the next attribute to be Numeric. This is followed by 2 C which means that the next two attributes are Categorical.

L represents the Label

*/

sh hadoop jar /home/cloudera/mahout-distribution-0.8/core/target/mahout-core-0.8-job.jar org.apache.mahout.classifier.df.tools.Describe -p /user/cloudera/pdp/datasets/advanced_patterns/german-train.data -f /user/cloudera/pdp/datasets/advanced_patterns/german-train.info -d C N 2 C N 2 C N 2 C N C N 2 C N C N 2 C L

/*

Build Random Forests

-t specifies the number of decision trees to build

-p specifies usage of partial implementation

-sl specifies the number of random attributes to select for each node

-o specifies the output directory

-d specifies the path to the training dataset

-ds specifies the data descriptor

-Dmapred.max.split.size indicates the maximum size of each partition

*/

sh hadoop jar /home/cloudera/mahout-distribution-0.8/examples/target/mahout-examples-0.8-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d /user/cloudera/pdp/datasets/advanced_patterns/german-train.data -ds /user/cloudera/pdp/datasets/advanced_patterns/german-train.info -sl 5 -p -t 100 -o /user/cloudera/pdp/output/advanced_patterns/classification

/*

Predict the label in the test dataset

-i specifies the file path of the test dataset

-ds specifies the dataset descriptor, we use the one generated for training data as the data description is the same for both training and test data

-m specifies the file path of the decision tree built on the training data

-a specifies that confusion matrix has to be calculated

-mr specifies usage of Hadoop to distribute the classification

-o specifies the output directory

*/

sh hadoop jar /home/cloudera/mahout-distribution-0.8/examples/target/mahout-examples-0.8-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i /user/cloudera/pdp/datasets/advanced_patterns/german-test.data -ds /user/cloudera/pdp/datasets/advanced_patterns/german-train.info -m /user/cloudera/pdp/output/advanced_patterns/classification -a -mr -o /user/cloudera/pdp/output/advanced_patterns/classification_pred

Results

The following snapshot shows the results after executing the code in this pattern on the dataset:

Results

The decision tree output

The preceding matrix shows the comparison between the predicted and actual results. We can see that the model predicted 154 instances correctly, while it classified 46 instances incorrectly. The confusion matrix shows that out of 60 instances, 22 were correctly classified as bad customers and 38 were wrongly classified as good. Similarly, out of 140 instances, 132 were correctly classified as good customers and 8 were wrongly classified as bad.

Additional information

The complete code and dataset for this section can be found in the following GitHub directories:

·        Chapter7/code/

·        Chapter7/datasets/

Information on using Mahout for classification is present at https://mahout.apache.org/users/stuff/partial-implementation.html.

Future trends

When I began writing this book, the usage of Pig was moving quickly. Knowledge about new usage patterns, new features, and new systems that are integrated with Pig is being pushed into the public domain by a variety of industries and by academia at regular intervals. These developments will have a direct effect on the Pig design patterns explored in this book. The adoption of newer techniques will also drive the user community's documentation of Pig design patterns by sharing new patterns and by maturing the already existing patterns.

Emergence of data-driven patterns

In this book, we have extensively dealt with using Pig design patterns in the traditional enterprise settings. The future holds great promise owing to the growth of the Internet of Things phenomenon. In the future, the Internet of Things will enable every human artifact, every physical object of the world, and even every person to be plausibly networked. All of these things will be capable of being connected, read, and monitored, and data-driven intelligence will be delivered continuously.

In the traditional setting, the data journeys along familiar routes. Exclusive data and information is lodged in regular databases and analyzed in reports; it then rises up the management chain.

These familiar routes of data and information flow will change according to the newer paradigm of the Internet of Things in which data from the external world (from sensors and actuators of devices) is poised to be an important information source to drive real analytics from the truly connected world.

Emerging Pig design patterns might potentially address the impending data deluge emanating from the Internet of Things. These patterns might deal with integrating high-velocity streaming data at regular intervals and perform streaming analysis using Pig. The proposed work related to implementing Pig on Storm and Pig on Tez could be a good starting point.

The emergence of solution-driven patterns

As design patterns continue to get wider acceptance for many business problems, users tend to see the merits of grouping these patterns into manageable modular chunks of reusable pattern libraries. In this book, the emphasis was to group patterns based on the familiar route the data takes from ingestion to egression; there might be a novel grouping mechanism in which the patterns are grouped based on the functional usage. From this perspective, newer design patterns could potentially emerge to fill the gaps, which this book has not addressed.

Patterns addressing programmability constraints

Pig is designed to be a data-driven procedural language, which can perform small-scale analysis and is not suitable for implementing complex mathematical algorithms. Its mainstay is to be in front of the data pipeline trying to understand, ingest, and integrate data so that data can be analyzed by the likes of R, Weka, Python, OpenNLP, and Mahout libraries.

There is an immediate and compelling need to make the integration of these external libraries seamless with Pig, owing to the inherent complexities involved. Typically, while integrating Pig with R or any other analytics library, we encounter difficulties. These include not finding all the commonly used algorithms implemented in the library, problems registering the library functions, issues with data type incompatibilities, lack of built-in functions, and many others.

Newer design patterns could potentially emerge, resulting in a framework with closer integration between these external libraries and Pig. The extensibility features of Pig, such as streaming and UDFs, could come in handy to implement these frameworks. These design patterns take advantage of both the statistical analysis capability of the libraries and the parallel data processing capability of Pig.

Summary

In this chapter, we explored advanced patterns that specifically deal with using Pig to analyze unstructured text data using various patterns.

We started by understanding the context and the motivation behind clustering text data; we then examined in brief several techniques followed by a use case that elaborates through Pig code. Similarly, we understood the relevance of topic models to understanding the latent context of textual documents using an example of text containing Big Data and medicine. We have explored how Pig integrates with the Python's NLTK library to perform natural language processing in order to decompose a text corpus into sentences and recognize named entities; these entities are eventually used in indexing and information retrieval. In the last pattern, we considered a credit dataset to illustrate the process of classification or predictive analytics using Mahout integrated with Pig.

The Future trends section scratched the surface to identify future design patterns in conjunction with the evolution of Pig as a mainstream programming language to process Big Data. This section also brings into perspective the progressive nature of design patterns that enables you to identify and develop new design patterns you haven't seen before and share it with the world.

I hope that this book has provided you a springboard for readily using the design patterns mentioned in this book. This bridges the gap between theoretical understanding and practical implementation of creating complex data pipelines, and apply it in various stages of data management life cycle and analytics. Through this book, we covered the journey of Big Data from the time it enters the enterprise to its eventual use in analytics; and throughout this journey, Pig design patterns performed the role of a catalyst to guide us through the successive steps of data management life cycle and analytics.