Pig Design Patterns (2014)

Chapter 3. Data Profiling Patterns

 

Time looking at your data is always well spent.

 
 

--Witten, et al

In the previous chapter, you studied the various patterns for ingesting and egressing different types of data into and from the Hadoop ecosystem, so that the next logical steps in the analytics process can begin. In this chapter, we will understand the most widely used design patterns related to data profiling. This chapter is all about a step-by-step approach to diagnose if your dataset has any problem, and ultimately turning the dataset into usable information.

Data profiling is a necessary first step in getting any meaningful insight into the data ingested by Hadoop, by understanding the content, context, structure, and condition of data.

The data profiling design patterns described in this chapter, collect important information on attributes of data in the Hadoop cluster, before initiating the process of cleaning the data into a more useful form. In this chapter, we will look at the following sections:

·        Understanding the data profiling concepts and their implications in the context of Big Data

·        Rationale for using Pig in data profiling

·        Understanding the data type inference design pattern

·        Understanding the basic statistical profiling design pattern

·        Understanding the pattern-matching design pattern

·        Understanding the string profiling design pattern

·        Understanding the unstructured text profiling design pattern

Data profiling for Big Data

Bad data lurks in all of the data that is ingested by Hadoop, but its impact magnifies with the phenomenal volume and variety that constitutes Big Data. Working with missing records, malformed values, and wrong file formats amplifies the amount of wasted time. What drives us to frustration is seeing the amount of data that we can't use even though we have it, data that we have at hand and then lost, and data that was not the same as it was yesterday. In a Big Data analytics project, it is common to be handed an extremely huge dataset without a lot of information as to where it came from, how it was collected, what the fields mean, and so on. In many cases, the data has gone through many hands and multiple transformations since it was gathered, and nobody really knows what it all means anymore.

Data profiling is a measure of how good the data is and the fitness to process it in the subsequent steps. It simply indicates what is wrong with the data. Data profiling is the first short-burst analysis performed on the data to determine its suitability, understand challenges, and give a go/no-go decision early in a data-intensive endeavor. Data profiling activities provide you key insights into what the data ingested by Hadoop looks like from a qualitative perspective, and assesses the risk involved in integrating data with other sources before deriving any analytical insights. Sometimes, the profiling process is conducted at various stages in the analysis process to weed out bad data and to refine the analysis itself.

Data profiling plays a major role in improving the overall data quality, readability, and processability, by helping us understand the data from a business standpoint versus an analytical-insight standpoint. Building a data profiling framework within a Big Data information management platform such as Hadoop, ensures that the quality of data does not affect the results of reporting, analysis, forecasting, and other key business requirements that deal with decision making.

Traditionally, data profiling is done in relation to its intended use, where the purpose of using the data is defined well in advance. For Big Data projects, data may have to be used in ways not originally intended, and it has to be profiled accordingly, to address how data can be repurposed. This is due to the fact that most Big Data projects deal with exploratory analytics on ill-defined data that try to find out how the data can be used or repurposed. In order to do so, various quality measures, such as completeness,consistencycoherencecorrectnesstimeliness, and reasonableness, must be clearly articulated, measured, and made available to users.

In a Big Data project, metadata is collected to ascertain the data quality. This metadata includes the following:

·        Data quality attributes

·        Business rules

·        Mappings

·        Cleansing routines

·        Data element profiles and measures

The measurement of data quality in a Big Data environment takes into account the following:

·        Source of data

·        Type of data

·        The intended and unintended usage of the data

·        The user group that will consume the data and the resultant analytical artifacts

Among the preceding points, the type of data plays a vital role on the data quality requirements, as outlined in the following points:

·        Profiling structured Big Data: In a Big Data project that deals with a huge amount of structured data, an enterprise can reuse existing data quality processes that deal with relational databases, given they can scale to meet the requirements of a massive scale.

·        Profiling unstructured Big Data: Social media related Big Data projects concern themselves with the quality issues that deal with entity extraction out of sentences expressed in non-standard language comprising of slangs and abbreviations. Analytical value from social media can be extracted by correlating it with structured transactional data, so that the relation between what is happening on the social web can be mapped to an organization's internal data, such as supply chain data or customer demographics data. To perform this kind of mapping, unstructured text has to be profiled to understand the following points:

·        How to extract entities that are important for analytics

·        How much of the data is misspelled

·        What are the general abbreviations specific to a domain

·        What criteria to use for the removal of stop words

·        How to perform stemming of the words

·        How to understand the contextual meaning of words based on preceding words

Big Data profiling dimensions

Profiling of Big Data is performed across multiple dimensions, and the choice of choosing a particular dimension, typically, depends on the analytical problem and balancing the time/quality trade-off. Some of the dimensions overlap and some are not applicable to the problem at all. The following are a few of the most important dimensions that measure the quality of Big Data:

·        Completeness: This dimension is a measure to know if you have all the data required to answer your queries. To evaluate if your dataset is complete, start by understanding the answers that you wish to seek from the data, and determine the fields needed and the percentage of complete records required to comfortably answer these questions. The following are a few of the ways completeness can be determined:

·        In cases where we have prior knowledge of the master data statistics (number of records, fields, and so on), completeness can be determined as a proportion of the ingested records to the number of master data records

·        In cases where a master data statistics is not accessible, completeness is measured by the presence of NULLS in the following ways:

·        Attribute completeness: It deals with the presence of NULLS in a specific attribute

·        Tuple completeness: It deals with the number of unknown values of the attributes in a tuple

·        Value completeness: It deals with missing complete elements or attributes in semi-structured XML data

As one of the aspects of Big Data analysis is to ask questions that have not been asked before, checking for the completeness dimension assumes a newer meaning. In this scenario, you may consider performing an iterative look ahead for a generic range of questions you expect to be answered and reason back to figure out what sort of data is needed to answer these questions. A straightforward record-counting mechanism can be applied to check if the total expected records exist in the Hadoop cluster; but, for data size spanning petabytes, this activity can be onerous and has to be performed by applying statistical sampling. If the data is found to be incomplete, the missing records can be fixed, deleted, flagged, or ignored depending on the analytical use case.

·        Correctness: This dimension measures the accuracy of the data. To find out if the data is accurate, you have to know what comprises inaccurate data, and this purely depends on the business context. In cases where data should be unique, duplicate data is considered inaccurate. Calculating the number of duplicate elements in data that is spread across multiple systems is a nontrivial job. The following techniques can be used to find out the measure of potential inaccuracy of data:

·        In a dataset containing discrete values, a frequency distribution can give valuable insights into the potential inaccuracy of data; a value with relatively very low frequency could probably be incorrect.

·        For strings, you can create string length distribution patterns and flag patterns with low frequency as potential suspects. Similarly, a string with atypical length could be flagged as incorrect.

·        In the case of continuous attributes, you can use descriptive statistics, such as maximum and minimum to flag data as inaccurate.

For Big Data projects, it is recommended to determine which subset of attributes are mandatory for accuracy, understand how much of the data should be accurate, and work on sampling the data to determine the accuracy

Note

It is important to note that in the classical finding a needle in a haystack kind of Big Data problem, there is a lot of analytical value hidden in inaccurate data, which can be considered as outliers. These outliers are not to be considered inaccurate, and they can be flagged and considered for further analysis in use cases such as fraud detection.

·        Coherence: This dimension measures if the data makes sense relative to itself and determines whether records relate to each other in ways that are consistent, and follow the internal logic of the dataset. The measure of coherence of a dataset can be understood by the following methods:

·        Referential integrity: This ensures that the relationships between tables remain consistent. In the Big Data context, referential integrity cannot be applied to data stored in NoSQL databases such as HBase, as there is no relational representation of the data.

·        Value integrity: This ensures if the values in a table are consistent relative to themselves. Inconsistent data can be found by comparing values with a predefined set of possible values (from the master data).

Sampling considerations for profiling Big Data

Sampling of Big Data is done to understand the quality of the data, by analyzing only a subset of the population rather than reaching out to the entire population. One of the most important criteria to select a sample is its representativeness, which determines how closely the sampled subset resembles the population. Representativeness should be higher for an accurate result. Sampling size also has a considerable impact on the representative accuracy of the subset.

Sampling enormous volumes of data for profiling balances the cost-quality trade-offs, since it is very costly and complex to profile all of the population. Most of the time, the profiling activity is not devised as a mechanism to perform a full-fledged analysis on the entire data, but it is a first-pass analysis/discovery phase, to get the overall quality of data, from the Correctness, Coherence, and Completeness dimensions. The profiling activity is conducted iteratively as the data moves through the pipeline and helps in data refinement. For this reason, sampling of the data for profiling purposes has a very important role to play in Big Data.

While sampling is deemed necessary for profiling, it is recommended to tread with caution while applying sampling techniques to Big Data. Due to the implementation complexity and inapplicableness, not all data types and collection mechanisms may necessitate sampling; this is valid in cases of data that are ingested from sensors in near real time. Similarly, not all use cases may need sampling, and this is valid in cases where data is ingested for search, recommendation systems, and click stream analytics. In these cases, the data has to be looked at in its entirety, without recourse to sampling. In these cases, sampling can introduce certain biases and reduce the accuracy of the results.

The choice of appropriate sampling techniques has an impact on the overall accuracy for profiling. These techniques include non-probabilistic sampling and probabilistic sampling methods. Generally, we do not consider non-probabilistic methods to perform data profiling activities. We limit ourselves to probabilistic sampling methods, owing to the improved accuracy with less representational bias. For a better overview of the sampling techniques, please refer to the Numerosity Reduction – Sampling Design Pattern section in Chapter 6Understanding Data Reduction Patterns.

Sampling support in Pig

Pig has native support for sampling through the usage of the SAMPLE operator. We have used the SAMPLE operator to illustrate how it works in the profiling context, using the basic statistical profiling design pattern. The SAMPLE operator helps you pick a random sample from the population using a probabilistic algorithm. The internal algorithm is quite rudimentary, and sometimes, not representative of the entire dataset that is being sampled. This algorithm internally uses the simple random sampling technique. The SAMPLEoperator is in the process of being evolved to accommodate more esoteric sampling algorithms. More information on the path ahead can be found at https://issues.apache.org/jira/browse/PIG-1713.

Other ways to implement a robust sampling method within Pig are to extend it by using the UDF feature and by using Pig streaming.

Using Pig's extensibility feature, sampling can be implemented as a UDF, but it is complex and taxing to work with, since the biggest limitation of a UDF is that it takes only one input value and generates one output value.

You might also consider implementing sampling using streaming, which doesn't have the limitation of the UDFs. Streaming can take any number of inputs and emit any number of outputs. The language R has the needed functions to perform sampling and you can use these functions in the Pig script through Pig streaming. The limitations of this method are: it performs the sampling computation by holding most of the data in the main memory and R has to be installed on every data node of the Hadoop cluster for streaming to work.

The Datafu library of Pig utilities, from LinkedIn, has published a few of its own implementations for sampling. This library is now a part of Cloudera's Hadoop distribution. The following are the sampling techniques implemented by Datafu:

·        ReservoirSampling: It generates random samples of a given size by using an in-memory reservoir

·        SampleByKey: It generates a random sample from tuples based on a certain key. This internally uses the Stratified Random Sampling technique

·        WeightedSample: It generates a random sample by assigning weights

Additional information on the Datafu sampling implementation can be found at http://linkedin.github.io/datafu/docs/current/datafu/pig/sampling/package-summary.html.

Rationale for using Pig in data profiling

Implementing the profiling code within the Hadoop environment reduces the dependency on external systems for quality checks. The high-level overview of implementation is depicted in the following diagram:

Rationale for using Pig in data profiling

Implementing profiling in Pig

The following are the advantages of performing data profiling within the Hadoop environment using Pig:

·        Implementing the design patterns in Pig reduces data movement by moving the profiling code directly to the data, resulting in performance gains and speeding up the analytics development process.

·        By implementing the pattern in Pig, the data quality effort is performed alongside the data transformation in the same environment. This alleviates manual, redundant effort in performing repetitive data quality checks whenever data is ingested into Hadoop.

·        Pig excels in situations where the ingested data's schema is unknown until runtime; its language features give data scientists flexibility to decipher the correct schemas at runtime and build prototype models.

·        Pig's inherent ability to discover data schema and sampling gives it an edge to implement profiling code within the Hadoop environment.

·        Pig has readily available functions that make writing the custom profiling code easier.

·        Pig enables automating the profiling process by chaining complex profiling workflows, which comes in handy for datasets that are periodically updated.

Now that we have understood the data profiling concepts and the rationale of using Pig for profiling, we will explore few concrete design patterns in the following sections.

The data type inference pattern

This section describes the data type inference design pattern in which we use Pig scripts to capture important information about data types.

Background

Most of the ingested data in Hadoop has some associated metadata, which is a description of its characteristics. This metadata includes important information on the types of fields, their length, constraints, and uniqueness. We can also know if a field is mandatory. This metadata is also used in interpretation of the values by examining the scale, units of measurement, meaning of labels, and so on. Understanding the intended structure of a dataset helps in expounding its meaning, description, semantics, and the data quality. This analysis of data types helps us to grasp if they are syntactically consistent (different datasets having the same consistent format specification) and semantically consistent (different datasets having the same value set).

Motivation

The intent of this design pattern is to infer the data type metadata from the ingested data in Hadoop. This pattern helps you uncover the Type metadata that is compared to the actual data, to see if they do not agree and cause any far-reaching consequence to the analytics effort. Data types and the attributes values are scanned and compared with documented metadata, and based on this scanning, appropriate data types and data lengths are proposed.

This design pattern is used to review the structure of a dataset for which there is little or no existing metadata, or for which there are reasons to suspect the completeness or quality of existing metadata. The results of the pattern help to discover, document, and organize the "ground truth" regarding the dataset's metadata. Here, the results of data profiling are used to incrementally capture a knowledge base associated with data element structure, semantics, and its use.

Use cases

You can use this design pattern when an enormous volume of structured data has to be ingested, and there is an absence of documented knowledge about the dataset. If there is a need to use the undocumented data for further analysis or the need for deeper knowledge of the domain business terms, related data elements, their definitions, the reference datasets used, and structure of the attributes in the dataset, you can use this design pattern.

Pattern implementation

This pattern is implemented as a standalone Pig script that internally uses a Java UDF. The core concept in the implementation of this pattern is the discovery of the dominant data type in a column. Firstly, the column values are examined to understand if they belong to int, long, double, string, or boolean type. After the values are evaluated, each data type is grouped together to find the frequency. From this analysis, we can find out which one is the dominant (most frequent) data type.

Code snippets

To illustrate the working of this pattern, we have considered the retail transactions dataset stored on the Hadoop Distributed File System (HDFS). It contains attributes, such as Transaction ID, Transaction date, Customer ID, Phone Number, Product, Product subclass,Product ID, Sales Price, and Country Code. For this pattern, we are interested in the values of the attribute Customer ID.

Pig script

The following is the Pig script illustrating the implementation of this pattern:

/*

Register the datatypeinferer and custom storage jar files

*/

REGISTER '/home/cloudera/pdp/jars/datatypeinfererudf.jar';

REGISTER'/home/cloudera/pdp/jars/customdatatypeinfererstorage.jar';

/*

Load the transactions dataset into the relation transactions

*/

transactions = LOAD'/user/cloudera/pdp/datasets/data_profiling/transactions.csv'USING  PigStorage(',') AS (transaction_id:long,transaction_date:chararray, cust_id:chararray, age:chararray,area:chararray, prod_subclass:int, prod_id:long, amt:int,asset:int, sales_price:int, phone_no:chararray,country_code:chararray);

/*

Infer the data type of the field cust_id by invoking the DataTypeInfererUDF.

It returns a tuple with the inferred data type.

*/

data_types = FOREACH transactions GENERATEcom.profiler.DataTypeInfererUDF(cust_id) AS inferred_data_type;

/*

Compute the count of each data type, total count, percentage.

The data type with the highest count is considered as dominant data type

*/

grpd = GROUP data_types BY inferred_data_type;

inferred_type_count = FOREACH grpd GENERATE group ASinferred_type, COUNT(data_types) AS count;

grpd_inf_type_count_all = GROUP inferred_type_count ALL;

total_count = FOREACH grpd_inf_type_count_all GENERATESUM(inferred_type_count.count) AS tot_sum,MAX(inferred_type_count.count) AS max_val;

percentage = FOREACH inferred_type_count GENERATE inferred_type AStype, count AS total_cnt,CONCAT((Chararray)ROUND(count*100.0/total_count.tot_sum),'%') ASpercent,(count==total_count.max_val?'Dominant':'Other') ASinferred_dominant_other_datatype;

percentage_ord = ORDER percentage BYinferred_dominant_other_datatype ASC;

/*

CustomDatatypeInfererStorage UDF extends the StoreFunc. All the abstract methods have been overridden to implement logic that writes the contents of the relation into a file in a custom report like format.

The results are stored on the HDFS in the directory datatype_inferer

*/

STORE percentage_ord INTO'/user/cloudera/pdp/output/data_profiling/datatype_inferer'using com.profiler.CustomDatatypeInfererStorage('cust_id','chararray');

Java UDF

The following is the Java UDF code snippet:

@Override

  public String exec(Tuple tuples) throws IOException {

    String value = (String) tuples.get(0);

    String inferredType = null;

    try {

/*if tuples.get(0) is null it returns null else invokes getDataType() method to infer the datatype

      */

      inferredType = value != null ? getDataType(value) : NULL;

    } catch (Exception e) {

      e.printStackTrace();

  }

    // returns inferred datatype of the input value

    return inferredType;

Results

The following is the result of applying the design pattern on the transactions data:

Column Name :  cust_id

Defined Datatype :  chararray

Inferred Dominant Datatype(s):  int, Count: 817740 Percentage: 100%

In the previous result, the input data column, cust_id, is evaluated to check if the values accurately reflect the defined data type. At the ingestion stage, the data type is defined as chararray. By using the data inference design pattern, the data type of the values in thecust_id column is inferred as an integer.

Additional information

The complete code and datasets for this section are in the following GitHub directories:

·        Chapter3/code/

·        Chapter3/datasets/

The basic statistical profiling pattern

This section describes the basic statistical profiling design pattern in which we use Pig scripts to apply statistical functions to capture important information about data quality.

Background

The previous design pattern depicts one way of inferring the data type. The next logical step in the data profiling process is to evaluate the quality metrics of the values. This is done by collecting and analyzing the data by applying statistical methods. These statistics provide a high-level overview of the suitability of the data for a particular analytical problem, and uncover potential problems early in the data lifecycle management.

Motivation

The basic statistical profiling design pattern helps to create data quality metadata that includes basic statistics, such as mean, median, mode, maximum, minimum, and standard deviation. These stats give you a complete snapshot of the entire data field, and tracking these statistics over time will give insights into the characteristics of new data that is being ingested by the Hadoop cluster. Basic statistics of new data could be checked before ingesting it into Hadoop, to be forewarned about the inconsistent data and help prevent adding low-quality data.

This design pattern tries to address the following profiling requirements:

·        Range analysis methods scan values and determine if the data is subject to a total ordering, and also determine whether the values are constrained within a well-defined range

·        The sparseness of the data can be evaluated to find the percentage of the elements that are not populated

·        The cardinality of the dataset can be analyzed by finding the number of distinct values that appear within the data

·        The uniqueness can be evaluated to figure out if each of the value assigned to the attribute is indeed exclusive

·        Overloading of the data can be evaluated to check if the attribute is being used for multiple purposes

·        Format evaluation can be done by resolving unrecognized data into defined formats

Use cases

The following are the use cases where the basic statistical profiling design pattern can be applied:

·        This design pattern can be used to detect anomalies in the dataset, by empirically analyzing the values in a dataset to look for unexpected behaviors. This pattern examines the dataset's frequency distributions, the variance, percentage of data logged, and its relationships, to reveal potential flawed data values.

·        One of the common use cases where this design pattern can be potentially used is when the data is ingested into the Hadoop cluster from legacy data sources that are still actively used. In legacy systems such as mainframes, the mainframe programmers, during the data creation process, design shortcuts and encodings, and overload a particular field for different purposes that are no longer used or understood. When such kinds of data are ingested into Hadoop, the basic statistical design pattern can help uncover this issue.

Pattern implementation

This design pattern is implemented in Pig as a standalone script, which internally uses a macro to pass parameters and retrieve the answers. Pig Latin has a set of Math functions that can be directly applied to a column of data. Data is first loaded into the Pig relation and then the relation is passed as a parameter to the getProfile macro. This macro iterates over the relation and applies the Math function to each of the columns. The getProfile macro is designed to be modular and can be applied across various datasets to get a good understanding of the data profile.

Code snippets

To illustrate the working of this pattern, we have considered the retail transactions dataset stored on the HDFS. It contains attributes, such as Transaction ID, Transaction date, Customer ID, Phone Number, Product, Product subclass, Product ID, Sales Price, and Country Code. For this pattern, we will be profiling the values of the attribute Sales Price.

Pig script

The following is the Pig script illustrating the implementation of this pattern:

/*

Register the datafu and custom storage jar files

*/

REGISTER '/home/cloudera/pdp/jars/datafu.jar';

REGISTER '/home/cloudera/pdp/jars/customprofilestorage.jar';

/*

Import macro defined in the file numerical_profiler_macro.pig

*/

IMPORT '/home/cloudera/pdp/data_profiling/numerical_profiler_macro.pig';

/*

Load the transactions dataset into the relation transactions

*/

transactions = LOAD'/user/cloudera/pdp/datasets/data_profiling/transactions.csv'USING  PigStorage(',') AS (transaction_id:long,transaction_date:datetime, cust_id:long, age:chararray,area:chararray, prod_subclass:int, prod_id:long, amt:int,asset:int, sales_price:int, phone_no:chararray,country_code:chararray);

/*

Use SAMPLE operator to pick a subset of the data, at most 20% of the data is returned as a sample

*/

sample_transactions = SAMPLE transactions 0.2;

/*

Invoke the macro getProfile with the parameters sample_transactions which contains a sample of the dataset and the column name on which the numerical profiling has to be done.

The macro performs numerical profiling on the sales_price column and returns various statistics like variance, standard deviation, row count, null count, distinct count and mode

*/

result =  getProfile(sample_transactions,'sales_price');

/*

CustomProfileStorage UDF extends the StoreFunc. All the abstract methods have been overridden to implement logic that writes the contents of the relation into a file in a custom report like format.

The results are stored on the HDFS in the directory numeric

*/

STORE result INTO'/user/cloudera/pdp/output/data_profiling/numeric' USINGcom.profiler.CustomProfileStorage();

Macro

The following is the Pig script showing the implementation of the getProfile macro:

/*

Define alias VAR for the function datafu.pig.stats.VAR

*/

DEFINE VAR datafu.pig.stats.VAR();

/*

Define the macro, specify the input parameters and the return value

*/

DEFINE getProfile(data,columnName) returns numerical_profile{

/*

Calculate the variance, standard deviation, row count, null count and distinct count for the column sales_price

*/

data_grpd = GROUP $data ALL;

numerical_stats = FOREACH data_grpd 

{

  variance = VAR($data.$columnName);

  stdDeviation = SQRT(variance);

  rowCount = COUNT_STAR($data.$columnName);

  nullCount = COUNT($data.$columnName);

  uniq = DISTINCT $data.$columnName;

  GENERATE 'Column Name','$columnName' AS colName,'Row Count',rowCount,'Null Count' , (rowCount - nullCount),'Distinct Count',COUNT(uniq),'Highest Value',MAX($data.$columnName) AS max_numerical_count,'Lowest Value',MIN($data.$columnName) ASmin_numerical_count, 'Total Value',SUM($data.$columnName) AStotal_numerical_count,'Mean Value', AVG($data.$columnName) ASavg_numerical_count,'Variance',variance AS    variance,'StandardDeviation', stdDeviation AS stdDeviation,'Mode' asmodeName,'NONE' as modevalue;

}

/*

Compute the mode of the column sales_price

*/

groupd = GROUP $data BY $columnName;

groupd_count = FOREACH groupd GENERATE 'Mode' as modeName, groupAS mode_values, (long) COUNT($data) AS total;

groupd_count_all = GROUP groupd_count ALL;

frequency = FOREACH groupd_count_all GENERATEMAX(groupd_count.total) AS fq;

filterd = FILTER groupd_count BY (total== frequency.fq AND total>1AND mode_values IS NOT NULL);

mode  = GROUP filterd BY modeName;

/*

Join relations numerical stats and mode. Return these values

*/

$numerical_profile = JOIN numerical_stats BY modeName FULL,mode BY group;

};

Results

By using the basic statistical profiling pattern, the following results are obtained:

Column Name: sales_price

Row Count: 163794

Null Count: 0

Distinct Count: 1446

Highest Value: 70589

Lowest Value: 1

Total Value: 21781793

Mean Value: 132.98285040966093

Variance: 183789.18332067598

Standard Deviation: 428.7064069041609

Mode: 99

The previous results summarize the properties of the data, its row count, the null count, and the number of distinct values. We also learn about the key characteristics of the data with respect to central tendency and dispersion. Mean and mode are few of the measures of central tendency; variance is one method of knowing data dispersion.

Additional information

The complete code and datasets for this section are in the following GitHub directories:

·        Chapter3/code/

·        Chapter3/datasets/

The pattern-matching pattern

This section describes the pattern-matching design pattern in which we use Pig scripts to match numeric and text patterns, to ascertain if the data is coherently relative to itself and thus, get a measure of data quality.

Background

In the enterprise context, examining the data for coherence comes after the data has been ingested and its completeness and correctness has been ascertained. The values of a given attribute can come in different shapes and sizes. This is especially true for fields requiring human input, where the values are entered according to the whims of the user. Assuming a column representing the phone number field is coherent, it can be said that all the values represent valid phone numbers since they match the expected format, length, and data type (numeric), thus meeting the expectation of the system. Wrongly representing data in incorrect format leads to inaccurate analytics, and in the Big Data context, its sheer volume can amplify this inaccuracy.

Motivation

Profiling the data from the pattern-matching perspective, measures the consistency of data and the amount of data matching an expected pattern. This profiling process finds out if the values are consistently relative to themselves by comparing these values with a predefined set of possible values. It captures the essence of the data and tells you if a field is completely numeric or has consistent length. It also gives other format-specific information about the data. Pattern evaluation is done by resolving unrecognized data into defined formats. Abstract type recognition is done on the data to perform a semantic data-type association based on pattern analysis and usage. Identifying the percentage inaccuracy of mismatched patterns at an earlier stage of the analytics cycle ensures better cleaning of data and reduced effort.

Use cases

This design pattern can be used to profile numerical or string data that is supposed to match a particular pattern.

Pattern implementation

This design pattern is implemented in Pig as a standalone script. This script attempts to discover patterns in the data and the common types of records by analyzing the string of data stored in the attribute. It generates several patterns that match the values in the attribute, and reports the percentages of the data that follows each candidate pattern. The script primarily performs the following tasks:

·        The patterns are discovered from the tuples; the count and percentage of each of them are calculated

·        Examines the discovered patterns and classifies them as valid or invalid

Code snippets

To illustrate the working of this pattern, we have considered the retail transactions dataset stored in HDFS. It contains attributes such as Transaction ID, Transaction date, Customer ID, Phone Number, Product, Product subclass, Product ID, Sales Price, and Country Code. For this pattern, we are interested in the values of the attribute Phone Number.

Pig script

The following is the Pig script illustrating the implementation of this pattern:

/*

Import macro defined in the file pattern_matching_macro.pig

*/

IMPORT '/home/cloudera/pdp/data_profiling/pattern_matching_macro.pig';

/*

Load the dataset transactions.csv into the relation transactions

*/

transactions = LOAD'/user/cloudera/pdp/datasets/data_profiling/transactions.csv'USING  PigStorage(',') AS (transaction_id:long,transaction_date:datetime, cust_id:long, age:chararray,area:chararray, prod_subclass:int, prod_id:long, amt:int,asset:int, sales_price:int, phone_no:chararray,country_code:chararray);

/*

Invoke the macro and pass the relation transactions and the column phone_no as parameters to it.

The pattern matching is performed on the column that is passed.

This macro returns the phone number pattern, its count and the percentage

*/

result = getPatterns(transactions, 'phone_no');

/*

Split the relation result into the relation valid_pattern if the phone number pattern matches any of the two regular expressions. The patterns that do not match any of the regex are stored into the relation invalid_patterns

*/

SPLIT result INTO valid_patterns IF (phone_number MATCHES'([0-9]{3}-[0-9]{3}-[0-9]{4})' or phone_number MATCHES'([0-9]{10})'), invalid_patterns OTHERWISE;

/*

The results are stored on the HDFS in the directories valid_patterns and invalid_patterns

*/

STORE valid_patterns INTO '/user/cloudera/pdp/output/data_profiling/pattern_matching/valid_patterns';

STORE invalid_patterns INTO '/user/cloudera/pdp/output/data_profiling/pattern_matching/invalid_patterns';

Macro

The following is the Pig script showing the implementation of the getPatterns macro:

/*

Define the macro, specify the input parameters and the return value

*/

DEFINE getPatterns(data,phone_no) returns percentage{

/*

Iterate over each row of the phone_no column and transform each value by replacing all digits with 9 and all alphabets with a to form uniform patterns

*/

transactions_replaced = FOREACH $data 

{

  replace_digits = REPLACE($phone_no,'\\d','9');

  replace_alphabets = REPLACE(replace_digits,'[a-zA-Z]','a');

  replace_spaces = REPLACE(replace_alphabets,'\\s','');

  GENERATE replace_spaces AS phone_number_pattern;

}

/*

Group by phone_number_pattern and calculate count of each pattern

*/

grpd_ph_no_pattern = GROUP transactions_replaced BYphone_number_pattern;

phone_num_count = FOREACH grpd_ph_no_pattern GENERATE group asphone_num, COUNT(transactions_replaced.phone_number_pattern) ASphone_count;

/*

Compute the total count and percentage.

Return the relation percentage with the fields phone number pattern, count and the rounded percentage

*/

grpd_ph_no_cnt_all = GROUP phone_num_count ALL;

total_count = FOREACH grpd_ph_no_cnt_all GENERATESUM(phone_num_count.phone_count) AS tot_sum;

$percentage = FOREACH phone_num_count GENERATE phone_num asphone_number, phone_count as phone_number_count,CONCAT((Chararray)ROUND(phone_count*100.0/total_count.tot_sum),'%') as percent;

};

Results

The following is the result of applying the design pattern on the transactions data. The results are stored in the folders valid_patterns and invalid_patterns.

Output in the folder valid_patterns is as follows:

9999999999  490644    60%

999-999-9999  196257    24%

Output in the folder invalid_patterns is as follows:

99999         8177  1%

aaaaaaaaaa  40887  5%

999-999-999a  40888  5%

aaa-aaa-aaaa  40887  5%

The previous results give us a snapshot of all the patterns of phone numbers that exist in the dataset, their count, and the percentage. Using this data, we can determine the percentage of inaccurate data in the dataset, and take necessary measures in the data-cleansing stage. As the relative frequency of phone numbers in the format 999-999-9999 is more, and it is a valid pattern, you can derive a rule that requires all values in this attribute to conform to this pattern. This rule can be applied in the data-cleansing phase.

Additional information

The complete code and datasets for this section are in the following GitHub directories:

·        Chapter3/code/

·        Chapter3/datasets/

The string profiling pattern

This section describes the string profiling design pattern in which we use Pig scripts on textual data to know important statistics.

Background

A majority of Big Data implementations deal with text data embedded in columns. To gain insight from these columns, they have to be integrated with other enterprise-structured data. This design pattern elaborates a few of the ways that help understand the quality of textual data.

Motivation

The quality of textual data can be ascertained by applying basic statistical techniques on the values of the attributes. Finding the string length is the most important dimension in selecting the appropriate data types and sizes for the target system. You can use the maximum and minimum string lengths to determine, at a glance, if the data ingested into Hadoop meets a given constraint. While dealing with data sizes in the petabyte range, limiting the character count to be just large enough optimizes storage and computation by cutting down on unnecessary storage space.

Using the string lengths, you can also determine distinct lengths of individual strings in a column and the percentage of rows in the table that each length represents.

For example, the profile of a column representing US State Codes is supposed to be two characters but if the profile gathered shows distinct values other than two characters, this indicates that the values in the column are not coherent.

Use cases

This pattern can be applied on data columns that predominantly contain text data type to find out if the text is within the constraints defined.

Pattern implementation

This design pattern is implemented in Pig as a standalone script, which internally uses a macro to retrieve the profile. Pig Latin has a set of math functions that can be directly applied to a column of data. Data is first loaded in to the Pig relation transactions and then the relation is passed as a parameter to the getStringProfile macro. This macro iterates over the relation and applies the Math function on each of the values. The getStringProfile macro is designed to be modular and can be applied across various text columns to get a good understanding of the string data profile.

Code snippets

To illustrate the working of this pattern, we have considered the retail transactions dataset stored in HDFS. It contains attributes such as Transaction ID, Transaction date, Customer ID, Phone Number, Product, Product subclass, Product ID, Sales Price, and Country Code. For this pattern, we are interested in the values of the attribute Country Code.

Pig script

The following is the Pig script illustrating the implementation of this pattern:

/*

Register the datafu and custom storage jar files

*/

REGISTER '/home/cloudera/pdp/jars/datafu.jar';

REGISTER '/home/cloudera/pdp/jars/customprofilestorage.jar';

/*

Import macro defined in the file string_profiler_macro.pig

*/

IMPORT'/home/cloudera/pdp/data_profiling/string_profiler_macro.pig';

/*

Load the transactions dataset into the relation transactions

*/

transactions = LOAD'/user/cloudera/pdp/datasets/data_profiling/transactions.csv'using PigStorage(',') as(transaction_id:long,transaction_date:datetime, cust_id:long,age:chararray, area:chararray, prod_subclass:int, prod_id:long,amt:int, asset:int, sales_price:int, phone_no:chararray,country_code:chararray);

/*

Invoke the macro getStringProfile with the parameters transactions and the column name on which the string profiling has to be done.

The macro performs string profiling on the country_code column and returns various statistics like row count, null count, total character count, word count, identifies distinct country codes in the dataset and calculates their count and percentage.

*/

result =  getStringProfile(transactions,'country_code');

/*

CustomProfileStorage UDF extends the StoreFunc. All the abstract methods have been overridden to implement logic that writes the contents of the relation into a file in a custom report like format.

The results are stored on the HDFS in the directory string

*/

STORE result INTO'/user/cloudera/pdp/output/data_profiling/string' USINGcom.profiler.CustomProfileStorage();

Macro

The following is the Pig script showing the implementation of the getStringProfile macro:

/*

Define the macro, specify the input parameters and the return value

*/

DEFINE getStringProfile(data,columnName) returns string_profile{

/*

Calculate row count and null count on the column country_code

*/

data_grpd = GROUP $data ALL;

string_stats = FOREACH data_grpd 

{

  rowCount = COUNT_STAR($data.$columnName);

  nullCount = COUNT($data.$columnName);

  GENERATE 'Column Name','$columnName' AS colName,'Row Count',rowCount,'Null Count' ,(rowCount - nullCount),'Distinct Values' as dist,'NONE' as distvalue;

}

/*

Calculate total char count, max chars, min chars, avg chars on the column country_code

*/

size = FOREACH $data GENERATE SIZE($columnName) AS chars_count;

size_grpd_all = GROUP size ALL;

char_stats = FOREACH size_grpd_all GENERATE 'Total CharCount',SUM(size.chars_count) AS total_char_count,'Max Chars',MAX(size.chars_count) AS max_chars_count,'Min Chars',MIN(size.chars_count) AS min_chars_count,'Avg Chars',AVG(size.chars_count) AS avg_chars_count,'Distinct Values' asdist,'NONE' as distvalue;

/*

Calculate total word count, max words and min words on the column country_code

*/

words = FOREACH $data GENERATE FLATTEN(TOKENIZE($columnName)) ASword;

whitespace_filtrd_words = FILTER words BY word MATCHES '\\w+';

grouped_words = GROUP whitespace_filtrd_words BY word;

word_count = FOREACH grouped_words GENERATECOUNT(whitespace_filtrd_words) AS count, group AS word;

word_count_grpd_all = GROUP word_count ALL;

words_stats = FOREACH word_count_grpd_all GENERATE 'WordCount',SUM(word_count.count) AS total_word_count,'Max Words',MAX(word_count.count) AS max_count,'Min Words',MIN(word_count.count) AS min_count,'Distinct Values'as dist,'NONE' as distvalue;

/*

Identify distinct country codes and their count

*/

grpd_data = GROUP $data BY $columnName;

grpd_data_count = FOREACH grpd_data GENERATE group ascountry_code, COUNT($data.$columnName) AS country_count;

/*

Calculate the total sum of all the counts

*/

grpd_data_cnt_all = GROUP grpd_data_count ALL;

total_count = FOREACH grpd_data_cnt_all GENERATESUM(grpd_data_count.country_count) AS tot_sum;

/*

Calculate the percentage of the distinct country codes

*/

percentage = FOREACH grpd_data_count GENERATE country_code ascountry_code,

country_count as country_code_cnt,ROUND(country_count*100.0/total_count.tot_sum) aspercent,'Distinct Values' as dist;

/*

Join string stats, char_stats, word_stats and the relation with distinct country codes, their count and the rounded percentage. Return these values

*/

percentage_grpd = GROUP percentage BY dist;

$string_profile = JOIN string_stats BY dist,char_stats BY dist ,words_stats BY dist, percentage_grpd BY group;

};

Results

By using the string profiling pattern, the following results are obtained:

Column Name: country_code

Row Count: 817740

Null Count: 0

Total Char Count: 5632733

Max Chars: 24

Min Chars: 2

Avg Chars: 6.888171056815125

Word Count: 999583

Max Words: 181817

Min Words: 90723

Distinct Values

country_code      Count    Percentage

US          181792    22%

U.S                90687    11%

USA                181782    22%

U.S.A        90733    11%

America        90929    11%

United States      91094    11%

United States of America  90723    11%

The previous results summarize the properties of the data, such as its row count, number of occurrences of null, and total number of characters; the Max chars and Min chars count can be used to validate the data quality by checking if the length of the values is within a range. As per the metadata, the valid value for a country code should be two characters, but the results show that the maximum character count is 24, which implies that the data is inaccurate. The results in the Distinct values section show the distinct country codes in the dataset with their count and percentage. Using these results we can determine the percentage of inaccurate data in the dataset and take necessary measures in the data-cleansing stage.

Additional information

The complete code and datasets for this section are in the following GitHub directories:

·        Chapter3/code/

·        Chapter3/datasets/

The unstructured text profiling pattern

This section describes the unstructured text profiling design pattern in which we use Pig scripts on free-form text data to know important statistics.

Background

Text mining is done on unstructured data ingested by Hadoop to extract interesting and non-trivial meaningful patterns, from blocks of meaningless data. Text mining is an interdisciplinary field, which draws on information retrieval, data mining, machine learning, statistics, and computational linguistics, to accomplish the extraction of meaningful patterns from text. Typically, the parallel-processing power of Hadoop is used to process massive amounts of textual data, to classify documents, cluster tweets, build ontologies, extract entities, perform sentiment analysis, and so on.

This pattern discusses a way of ascertaining the quality of text data using text pre-processing techniques, such as stopword removal, stemming, and TF-IDF.

Motivation

Unstructured text is inherently inconsistent and the inconsistencies can result in inaccurate analytics. Inconsistencies in textual data arise due to the fact that there are many ways of representing an idea.

Text pre-processing enhances the quality of data to improve the accuracy of analytics and reduces the difficulty of the text mining process. The following are the steps to accomplish text pre-processing:

·        One of the first steps in text pre-processing is to convert the blocks of text into tokens to remove punctuations, hyphens, brackets, and so on, and keep only the meaningful keywords, abbreviations, and acronyms for further processing. Tokenization involves a measure of a document's consistency, as there is a linear proportionality between the number of meaningless tokens eliminated and the inconsistency of data relative to itself.

·        Stopword removal is the next logical step performed in text pre-processing. This step involves removal of words that do not provide any meaning or context to the document. These words are known as stop words and they generally contain pronouns, articles, prepositions, and so on. A stop word list is defined before actually removing them from the original text. This list can include other words that are specific to a particular domain.

·        The stemming step reduces multiple forms of a word to its root form by removing or converting a word into its base word (stem). For example, agreed, agreeing, disagree, agreement, and disagreement are stemmed (depending on the specific stemming algorithm) to the word agree. This is done in order to make all the tokens in the corpus consistent.

·        After the stemming process is complete, the tokens are assigned weights relative to the frequency of their occurrence by calculating the term frequency. This statistic denotes the number of times a word occurs in a document. Inverse document frequency is calculated to know the frequency of the word across all the documents. This statistic determines if a word is common or rare across all the documents. Finding the term frequency and inverse document frequency has a bearing on the quality of the text, since these statistics tell you if you can discard or use a word based on its importance relative to the document or the corpus.

Use cases

This design pattern can be used in cases that require understanding the quality of unstructured text corpora through text pre-processing techniques. This design pattern is not exhaustive and covers a few important aspects of text pre-processing and its applicability to data profiling.

Pattern implementation

This design pattern is implemented in Pig as a standalone script, which internally uses the unstructuredtextprofiling Java UDF to perform stemming and to generate term frequency and inverse document frequency of the words. The script performs a right outer join to remove the stop words. The list of stop words are first loaded into a relation from an external text file and then used in the outer join.

Stemming is done through the usage of the Porter Stemmer algorithm implemented in the unstructuredtextprofiling JAR file.

Code snippets

To demonstrate the working of this pattern, we have considered the text corpus of Wikipedia, stored in a folder accessible to the HDFS. This sample corpus consists of the wiki pages related to Computer Science and Information Technology.

Pig script

The following is the Pig script illustrating the implementation of this pattern:

/*

Register custom text profiler jar

*/

REGISTER '/home/cloudera/pdp/jars/unstructuredtextprofiler.jar';

/*

Load stop words into the relation stop_words_list

*/

stop_words_list = LOAD'/user/cloudera/pdp/datasets/data_profiling/text/stopwords.txt'USING PigStorage();

/*

Tokenize the stopwords to extract the words

*/

stopwords = FOREACH stop_words_list GENERATEFLATTEN(TOKENIZE($0));

/*

Load the dataset into the relations doc1 and doc2.

Tokenize to extract the words for each of these documents

*/

doc1 = LOAD'/user/cloudera/pdp/datasets/data_profiling/text/computer_science.txt' AS (words:chararray);

docWords1 = FOREACH doc1 GENERATE 'computer_science.txt' ASdocumentId, FLATTEN(TOKENIZE(words)) AS word;

doc2 = LOAD'/user/cloudera/pdp/datasets/data_profiling/text/information_technology.txt' AS (words:chararray);

docWords2 = FOREACH doc2 GENERATE 'information_technology.txt' ASdocumentId, FLATTEN(TOKENIZE(words)) AS word;

/*

Combine the relations using the UNION operator

*/

combined_docs = UNION docWords1, docWords2;

/*

Perform pre-processing by doing the following

Convert the data into lowercase

Remove stopwords

Perform stemming by calling custom UDF. it uses porter stemmer algorithm to perform stemming

*/

lowercase_data = FOREACH combined_docs GENERATE documentId asdocumentId, FLATTEN(TOKENIZE(LOWER($1))) as word;

joind = JOIN stopwords BY $0 RIGHT OUTER, lowercase_data BY $1;

stop_words_removed = FILTER joind BY $0 IS NULL;

processed_data = FOREACH stop_words_removed GENERATE documentId asdocumentId, com.profiler.unstructuredtextprofiling.Stemmer($2)as word;

/*

Calculate word count per word/doc combination using the Group and FOREACH statement and the result is stored in word_count

*/

grpd_processed_data = GROUP processed_data BY (word, documentId);

word_count = FOREACH grpd_processed_data GENERATE group ASwordDoc,COUNT(processed_data) AS wordCount;

/*

Calculate Total word count per document using the Group and FOREACH statement and the result is stored in total_docs_wc

*/

grpd_wc = GROUP word_count BY wordDoc.documentId;

grpd_wc_all = GROUP grpd_wc ALL;

total_docs = FOREACH grpd_wc_all GENERATEFLATTEN(grpd_wc),COUNT(grpd_wc) AS totalDocs;

total_docs_wc = FOREACH total_docs GENERATEFLATTEN(word_count),SUM(word_count.wordCount) AS wordCountPerDoc,totalDocs;

/*

Calculate Total document count per word is using the Group and FOREACH statement and the result is stored in doc_count_per_word

*/

grpd_total_docs_wc = GROUP total_docs_wc BY wordDoc.word;

doc_count_per_word = FOREACH grpd_total_docs_wc GENERATEFLATTEN(total_docs_wc),COUNT(total_docs_wc) AS docCountPerWord;

/*

Calculate tfidf by invoking custom Java UDF.

The overall relevancy of a document with respect to a term is computed and the resultant data is stored in gen_tfidf

*/

gen_tfidf = FOREACH doc_count_per_word GENERATE $0.word AS word,$0.documentId AS documentId,com.profiler.unstructuredtextprofiling.GenerateTFIDF(wordCount,wordCountPerDoc,totalDocs,docCountPerWord) AS tfidf;

/*

Order by relevancy

*/

orderd_tfidf = ORDER gen_tfidf BY word ASC, tfidf DESC;

/*

The results are stored on the HDFS in the directory tfidf

*/

STORE orderd_tfidf into'/user/cloudera/pdp/output/data_profiling/unstructured_text_profiling/tfidf';

Java UDF for stemming

The following is the Java UDF code snippet:

public String exec(Tuple input) throws IOException {

    //Few declarations go here

    Stemmer s = new Stemmer();

    //Code for exception handling goes here

    //Extract values from the input tuple

    String str = (String)input.get(0);

    /*

    Invoke the stem(str) method of the class Stemmer.

    It return the stemmed form of the word

    */

    return s.stem(str);

}        

Java UDF for generating TF-IDF

The following is the Java UDF code snippet for computing TF-IDF:

public class GenerateTFIDF extends EvalFunc<Double>{

  @Override

  /**

  *The pre-calculated wordCount, wordCountPerDoc, totalDocs and docCountPerWord are passed as parameters to this UDF.

  */

  public Double exec(Tuple input) throws IOException {

    /*

    Retrieve the values from the input tuple

    */

    long countOfWords = (Long) input.get(0);

    long countOfWordsPerDoc = (Long) input.get(1);

    long noOfDocs = (Long) input.get(2);

    long docCountPerWord = (Long) input.get(3);

    /*

    Compute the overall relevancy of a document with respect to a term.

    */

    double tf = (countOfWords * 1.0) / countOfWordsPerDoc;

    double idf = Math.log((noOfDocs * 1.0) / docCountPerWord);

    return tf * idf;

  }

}

Results

The following are the results after applying the design pattern on the computer_science and information_technology wiki text corpus:

associat  information_technology.txt  0.0015489322470613302

author  information_technology.txt  7.744661235306651E-4

automat  computer_science.txt  8.943834587870262E-4

avail  computer_science.txt  0.0

avail  information_technology.txt  0.0

babbag  computer_science.txt  8.943834587870262E-4

babbage'  computer_science.txt  0.0026831503763610786

base  information_technology.txt  0.0

base  computer_science.txt  0.0

base.  computer_science.txt  8.943834587870262E-4

basic  information_technology.txt  7.744661235306651E-4

complex.  computer_science.txt  8.943834587870262E-4

compon  information_technology.txt  0.0015489322470613302

compsci  computer_science.txt  8.943834587870262E-4

comput  computer_science.txt  0.0

comput  information_technology.txt  0.0

computation  computer_science.txt  8.943834587870262E-4

computation.  computer_science.txt  8.943834587870262E-4

distinguish  information_technology.txt  7.744661235306651E-4

distribut  computer_science.txt  0.0

distribut  information_technology.txt  0.0

divid  computer_science.txt  8.943834587870262E-4

division.  computer_science.txt  8.943834587870262E-4

document  information_technology.txt  7.744661235306651E-4

encompass  information_technology.txt  7.744661235306651E-4

engin  computer_science.txt  0.0035775338351481047

engine.[5]  computer_science.txt  8.943834587870262E-4

enigma  computer_science.txt  8.943834587870262E-4

enough  computer_science.txt  0.0017887669175740523

enterprise.[2]  information_technology.txt  7.744661235306651E-4

entertain  computer_science.txt  8.943834587870262E-4

The original text is passed through the stopword removal and stemming phases, and term frequency-inverse document frequency is then calculated. The results show the word, the document it belongs to, and the TF-IDF. Words with high TF-IDF imply a strong relationship with the document they appear in, and words with a relatively low TF-IDF are considered low quality and can be ignored.

Additional information

The complete code and datasets for this section are in the following GitHub directories:

·        Chapter3/code/

·        Chapter3/datasets/

Summary

In this chapter, we build upon what we've learned from Chapter 2Data Ingest and Egress Patterns, where we have integrated data from multiple source systems and ingested it into Hadoop. The next step is to find clues about the data type by looking at the constituent values. The values are examined to see if they are misrepresented, if their units are misinterpreted, or if the context of units is derived incorrectly. This sleuthing mechanism is discussed in more detail in the data type inference pattern.

In the basic statistical profiling pattern, we examine if the values meet the quality expectations of the use case by collecting statistical information on the numeric values to find answers to the following questions: For a numeric field, are all the values numeric? Do all of the values of enumerable fields fall into the proper set? Do the numeric fields meet the range constraint? Are they complete?, and so on.

The pattern-matching design pattern explores a few techniques to measure the consistency of both the numerical and text columnar dataset through its data type, data length, and regex patterns. The next pattern uncovers the quality metrics of columns representing string values by using various statistical methods. This is explained in detail in the string profiling design pattern. The unstructured text profiling design pattern is an attempt to formalize the text pre-processing techniques, such as stopword removal, stemming, and TF-IDF calculation, to understand the quality of unstructured text.

In the next chapter, we will focus on data validation and cleansing patterns that can be applied to a variety of data formats. After reading this chapter, the audience will be able to choose the right pattern in order to validate the accuracy and completeness of data by using techniques, such as constraint checks and regex matching. We will also discuss data cleansing techniques, such as filters and statistical cleansing, in the next chapter.