Foundations of Coding: Compression, Encryption, Error Correction (2015)

Chapter 1. Foundations of Coding

This first chapter is an introduction to the notion of code. It contains the mathematical background, necessary to manipulate the codes, and the coding operations. Through this chapter, the reader will understand the constraints of the transmission of information, starting from historical examples and eventually learning advanced techniques. The knowledge of some historical aspects is pedagogically useful to work first on simple structures. In cryptology, history is also essential for efficient protection against the whole range of known attacks.

The principle is always to start from examples of codes, showing step by step why some mathematical notions are useful so that the codes are short, safe, and efficient. While the following chapters describe recent, elaborate, and currently used protocols for coding, this chapter provides the foundations of this activity.

Simple objects from probability theory, algebra, or algorithmic are introduced along the lines of this chapter, when they become necessary. For example, block coding calls for the definition of structures in which elements can be added, multiplied, which justifies the introduction of groups and fields. The emphasis is always put on the effective construction of introduced structures. This calls for a section on algorithms, as well as polynomials and primitive roots.

This chapter is organized in four sections. The first three allow to focus on the three important mathematical notions related to coding: algorithms and their complexity, which is at the core of coding theory; probabilities, related to stream cipher; and algebra, related to block coding. Then, the last section of this chapter is devoted to the many facets of decoding, from ambiguity and information loss to secret code breaking.

We will denote by code a transformation method that converts the representation of a piece of information into another one. This definition is wide enough to include several mathematical objects (functions, algorithms, and transforms), which will be used throughout this book.

The word code will also apply to the result of these transformations1, namely the encoded information and its structure. But for now, in order to find our way in these definitions and to understand what is actually a code, here is a simple example mixing technologies and eras.

1.1 From Julius Caesar to Telecopy

In this section, we introduce all the fundamental properties of codes, using one of them built from real examples – currently or formerly used. It is the occasion to define and use the algorithms and their properties, which are constantly used in the whole book.

Suppose that we wish to send a piece of information by fax while guaranteeing the secrecy of the transmission. We could do it with the following code.

1.1.1 The Source: from an Image to a Sequence of Pixels

Transmission by fax enables us to pass images through a phone channel. We want this code to perform the requested transformations in order to obtain – quickly and secretly – an image similar to the original one at the end, in a format that could pass through the channel.

The first transformation process will be performed by the scanner of the device. It consists in reading the image and transforming it into a sequence of pixels that we can visualize as small squares either black or white. This is a code, according to the definition we have given, and we will write it as an algorithm, in a format we will adopt throughout this book.

ngr001

The input of the algorithm is also called the source, and the output is called the code.

The chosen method will be the following: the width of the source image is split into 1728 equal parts; the length is then split into lines so that the image is divided into squares of the same size, 1728 per line. These squares will be considered as pixels. Each pixel is given either the color black if the zone of the image is dark enough or the color white if the image is bright.

This is the first part of our code. The following sections describe the other transformations; one can use to encode the information: compression, encryption, and error correction.

1.1.2 Message Compression

In order to formalize source messages and codes, we define the language in which they are formulated. We call alphabet a finite set c1-math-0001 of elements (called characters). The cardinal number of a finite set V is the number of elements it contains and is denoted by c1-math-0002.

A sequence of characters belonging to V is called a string . We denote by c1-math-0003 the set of all the strings over V, and c1-math-0004 the set of all the strings whose length is not equal to 0. As the alphabet of the code and the alphabet of the source may differ, we will distinguish thesource alphabet and the code alphabet.

For the example of the fax, the source alphabet is the result of the scanner, that is, c1-math-0005 and the code alphabet will be the set of bits c1-math-0006. We could send the sequence of pixels directly as bits, as we have an immediate conversion to 0 (for white pixels) and 1 (for black pixels).

However, we can apply some compression principles to this sequence of 0s and 1s. We can imagine that the datasheet we want to send has only a few lines of text and is mainly white – which is very common with faxes. But is there a better way of sending a sequence of, let us say a 10 000 zeros, than simply using a 10 000 zeros? Surely such a method exists, and we have just used one by evoking that long sequence of bits without writing it explicitly. We have indicated that the code is composed of 10 000 zeros rather than writing them down.

The fax code performs that principle of compression line by line (i.e., over 1728-pixels blocks). For each line, Algorithm 1.1 describes precisely the encoding algorithm.

ngr002

For example, with this method, a completely white line will not be encoded with a sequence of 1728 zeros, but with the code “ 11011000000 0” which represents “ 1728 0” in the binary representation that will be sent through the channel. We have replaced a 1728-bit message by a 12-bit message. This is noticeably shorter.

We will give details of this principle of message compression – called Run-Length Encoding (RLE) – in Section 2.3.1.

For a better visibility, we used a space character in our representation (between 1728 and 0) but that character does not belong to the code alphabet. In practice, we will see later in this chapter what constraints that statement implies.

Exercise 1.1

A pixelized image meant to be sent by fax contains a lot of pairs “ 01.” What do you think of the fax code presented before? Give a more effective code.

Solution on page 281.

1.1.3 Error Detection

All the readers who have already used a phone channel will not be surprised to learn that its reliability is not infinite. Each message is likely to be distorted, and it is not impossible, assuming that “ 11011000000 0” was sent on the channel, to receive “ 11010000000 0” (modification of one bit) or “ 1101000000 0” (loss of one bit).

Phone channels usually have an error rate between c1-math-0007 and c1-math-0008, depending on their nature, which means that they can commit an error every 10,000 bits. This is far from being negligible when you consider long messages, and it can also modify their meaning. For an image, if the value 1728 becomes 1664 because of the loss of one single bit, a shift will occur in the final image and the result will be unusable.

The fax code enables the detection of such transmission errors. If an error is detected on a line, one can ask for a second transmission of the same line to have a confirmation and – as it is unlikely to have the same error twice at the same place – the message will be corrected.

The principle of error detection in the fax code is explained as follows: predetermined sequences are appended at the beginning and at the end of each line. Such flag sequences are used to establish bit and line synchronization (and in particular detect loss of bit). In order to illustrate this principle, let us suppose that “ 0” (respectively “ 1729”) is added at the beginning (respectively the end) of each line even if these are not the exact values that are used in practice. The receiver can then check that each line is in the format “ c1-math-0009,” with c1-math-0010 an integer thatprovides the number of consecutive bits and c1-math-0011 the color of these bits. In particular, the condition c1-math-0012 must be respected and the colors c1-math-0013 must be alternated. Thus, a modification or the loss of one bit is easily detected as soon as this format is not respected.

All error detection and correction principles, which will be closely studied in Chapter 4, are based on this principle: the addition of information in order to check the consistency of the received message.

1.1.4 Encryption

Now, let us suppose that, after having compressed the message and set up a format that enables error detection, we wish to keep the message secret for everybody but its recipient. The phone channel, like most channels, does not provide secrecy in itself. Every message that is transmitted through it can be easily read by a third party. The setting up of the secret consists in transforming the original message, the plaintext, putting it into a nonunderstandable form, the ciphertext, and putting it back into its original form at the end.

This is a technique that has been used by men as they started to communicate. In secret codes of Antiquity, the secret resided in the method that was used to produce the ciphertext, which is what we call the encryption algorithm nowadays. For example, historians have discovered messages encrypted by Julius Caesar's secret services. The messages were texts, and the algorithm substituted each letter of the original message M with the letter located three positions later in the alphabet. For the three last letters of the alphabet, the three first letters were used. For example, the word TROY became WURB. Hence, the text did not have any immediate signification. That is what is called mono-alphabetic substitution, as each letter is replaced by another one (always the same) in the message.

If Caesar had wanted to send a fax, he would have adapted his code to the numerical format, which would have given the function c1-math-0014 for every number sent on the channel where the number n is the size of the alphabet. Here it would have been 1730 as no number greater than 1729 would theoretically be used.

These encryption and decryption functions were then extended with a simple key K, an integer chosen secretly between the interlocutors. This is equivalent to the construction of a function c1-math-0015. As for the Spartans, they used a completely different encryption algorithm, called transposition encryption. Their system is based on the Scytale , a stick on which a strip was rolled. The message was written on the strip along the stick rather than along the scroll. This means that the consecutive letters of the message appeared on a circumlocution different from the one of the strip.

Figure 1.1 illustrates the encryption principle. In order to decrypt themessage, the recipient would have a stick with the same diameter as the stick used for the encryption.

Other cryptographic systems, more evolved, were created afterward (affine encryption c1-math-0016 – studied in Exercises 1.2 and 3.1; substitution encryption where each letter is replaced by a symbol of another alphabet – such as the Morse code, etc…).

nfg001

Figure 1.1 Spartan Scytale principle to transmit the text “ HELLOWORLD”

Exercise 1.2

Mark Antony intercepts a message sent by Caesar, encrypted with an affine encryption. Without knowing the key c1-math-0017, how can he decrypt the message?

Solution on page 282.

1.1.5 Decryption

In Section 1.1.4, we have seen some methods for encrypting a message and insuring the secrecy and the efficiency of its transmission. This description would not be complete without a short presentation of the principles of the complementary operation – decryption.

1.1.5.1 Attack and Decryption

The encrypted message arrives. It is a matter of decrypting it. If one is not given the key or the encryption method, we talk about an attack on the code, or a codebreaking. The discipline which handles the development of methods of attacking existing codes or of building more resistant codes is called cryptanalysis. For the recipient of the message – who knows the key – we talk about decryption.

We will soon talk about attacks. For now, the decryption method is very simple and consists in applying the inverse function of the encryption function, possibly customized with a key, namely c1-math-0018. In our case, the recipient of Caesar's fax would have to apply c1-math-0019 for every number received on the channel The message is now decrypted.

Now, we still have to perform the inverse of all the remaining transformations. First of all, the format of each line is checked in order to detect possible errors (in this case, we ask for a second transmission), then the decoding algorithm is performed which –given a sequence of numbers – will return the initial pixel sequence. Algorithm 1.2 formalizes this operation for each line.

ngr003

1.1.5.2 Decompression and Data Loss

When we apply this algorithm to all the lines, we obtain a sequence of pixels, identical to the one we built from the original image. Now, we can recover the original image, or at least a similar image as the only information we have is the value of the pixels. Thus, the method consists in printing the sequence of corresponding black and white squares on a sheet of paper. The image resulting from this operation will be a damaged version of the initial one, as the original squares were not completely black or white. That is what gives the nonaesthetic aspect of all faxes, but the essential information is preserved.

In this case, initial information is not entirely recovered – and arrives in an approximate form – thus we talk about encoding with (data-)loss. One often uses codes which admit a determined level of loss, providing that important information is not distorted. It is often the case with multimedia information (see Section 2.5).

1.1.6 Drawbacks of the Fax Code

We have completely described the encoding – and decoding – processes, while respecting the major constraints we will encounter in coding (compression efficiency, error detection, and secrecy). But for several reasons, the principles used in the fax code are not very usable in practice in common numerical communications.

Description 1.3

Compression: the way the RLE principle is applied here has several drawbacks. First of all, if the message is only composed of alternating black and white pixels, the size of the “ compressed” message will be greater than that of the original. Thus, there is absolutely no guarantee of the compression of the message. Moreover, including the bits c1-math-0020 is almost useless as they are alternated and the value of the first one would be enough to determine the value of the others. The fax code removes them. It then becomes a sequence of numbers c1-math-0021. But that sequence – encoded as bits (to each number its binary representation) – is more difficult to decode. Indeed, when receiving a bit sequence “ 1001011010010,” how do we know whether it represents a number c1-math-0022or several numbers appended ? In order to solve this issue, for example, we can encode every number with the same number of bits but then the compression algorithm is no longer optimal. We are now forced to represent “ 2” with the sequence “ 00000000010,” which increases the size of the message. We will see how to insure that a message can be decoded later in this chapter. In Chapter 2, we will see how to deduce good compression methods while insuring this decoding ability.

Error detection: this principle implies that one has to ask for a second transmission of information each time an error is detected, whereas we could develop codes that automatically correct the channel errors (Chapter 4). Moreover, there is no theoretical guarantee that this code adapts to phone channel error rates. We do not know the error rate detected by the code, nor if it can accomplish the same performance adding less information in the messages. Efficiency of the transmission may depend on the quality of error detection and correction principles.

Secrecy: Caesar's code is breakable by any beginner in cryptology. It is easy to decode any mono-alphabetic substitution principle, even without knowing the key. A simple cryptanalysis consists of studying the frequency of appearance of the letters in the ciphertext and of deducing the correspondence with the letters of the plaintext.

Table 1.1 presents the statistical distribution of the letters in this document written in LaTeX (LaTeX special tags are also considered). As this book is quite long, we can assume that these frequencies are representative of scientific texts in English written in LaTeX. It is obviously possible to obtain such frequency tables for literary English texts, scientific French texts, and so on.

Table 1.1 Letter Distribution in this LaTeX Script

E

13.80%

T

8.25%

I

8.00%

N

7.19%

A

6.63%

O

6.57%

R

6.37%

S

6.28%

C

4.54%

D

4.35%

L

4.34%

U

3.34%

M

3.15%

P

2.92%

H

2.81%

F

2.36%

G

1.86%

B

1.83%

X

1.35%

Y

1.07%

V

0.87%

W

0.69%

K

0.57%

Q

0.55%

Z

0.16%

J

0.14%

               

Exercise 1.3

Scipio discovers a parchment manuscript with the following ciphertext: HFJXFW BTZQI MFAJ GJJS UWTZI TK DTZ! Help Scipio decrypt this message.

Solution on page 282.

1.1.7 Orders of Magnitude and Complexity Bounds for Algorithms

We have described a code and we have presented its advantages and disadvantages. However, there is another critical point we have not yet mentioned: what about encoding and decoding speed? It depends on computer speed, but mostly on the complexity of the algorithms.

Size of numbers We consider numbers and their size either in decimal digits or in bits. Thus, a number m will be c1-math-0023 digits long and c1-math-0024 bits long. For instance, 128 bits can be represented with 39 decimal digits, 512 bits with 155 decimal digits, and 1024 bits with 309 decimal digits.

Computer speed Nowadays, the frequency of any PC is at least 1 GHz. That is to say it can execute 1 billion (c1-math-0025) operations per second. By way of comparison, the greatest speed in the universe is the speed of light: c1-math-0026 km/s c1-math-0027 m/s. It takes only a ten thousand millionth of a second for light to cross a room of 3 m long. During this period, your computer has performed 10 operations!!! Hence, we can say that current computers calculate at the speed of light.

Comparison with the size and age of the universe This computing speed is truly astronomical; yet the size of the numbers we manipulate remains huge. Indeed, one has to enumerate c1-math-0028 numbers just in order to count up to a 39 digit long number. In order to see how huge it is, let us calculate the age of the universe in seconds:

equation

Thus, a 1 GHz computer would take more than two million times the age of the universe just to count up to a number of “ only” 39 digits! As for a 155 digit number (commonly used in cryptography), it is simply the square of the number of electrons in the universe. Indeed, our universe is thought to contain about c1-math-0030 galaxies, each galaxy enclosing roughly 200 billion stars.Knowing that the weight of our sun is approximately c1-math-0031 kg and that the weight of an electron is c1-math-0032 kg, we have

equation

Yet such numbers will have to be manipulated. Their size will represent one of the algorithmic challenges we will have to take up, as well as a guarantee of the secrecy of our messages if it takes millions of years of computing – on the billion machines available on Earth, including the fastest ones – to decrypt them without the key. We will see how to build algorithms which can handle such numbers later.

Complexity bounds of the algorithms There may be many different algorithms for solving the same computational problem. However despite the power of computers, an ill-conceived algorithm could turn out to be unusable. It is therefore of interest to find a way to compare the efficiency of two algorithms. This is done on the basis of a computational model and the size of the problem instance, which is a measure of the quantity of input data.

There exists many different computational models. For instance, one can use Turing machines, random-access machines, or Boolean circuits. Yet we will not detail them in this book (see, e.g., [1] for more details). As for the size and in order to be able to compute it in every instance, we assume that the input is a sequence of bits and the size of the input is the length of the sequence. This statement implies that we have to build a code that transforms the input of an algorithm into a sequence of bits. This is often a quite simple operation. For instance, it takes about c1-math-0034 bits to code an integer a (we simply write it in base 2). Thus, the size of an integer is equal to its logarithm in base 2. The size of a sequence of black and white pixels is the length of that sequence.

The number of basic computer steps needed by an algorithm expressed as a function of the size of the problem is called the time complexity2 of the algorithm. This helps to define a machine-independent characterization of an algorithm's efficiency as the naive evaluation of an execution time crucially depends on the processor speed and various architecture-specific criteria.

Indeed, the notion of basic computer step is vague as the set of processor's instruction has a variety of basic primitives (branching, comparing, storing, performing arithmetic operations, etc.). However, all of them can be evaluated in terms of bit operations; Therefore, the true complexity of an algorithm is computed using this elementary unit. For the sake of simplicity, one sometimes counts only the number of arithmetical operations requested by the algorithm, typically limited to the four classical operations (addition, subtraction, multiplication, and division) and some binary operations such as bit shift. In that case, it should be reminded that the conversion to bit operation depends on the base field.

As it is often impossible to count exactly all the basic computer steps performed during the execution of a program, the complexity of an algorithm is generally given with asymptotic upper bounds and lower bounds using the Big-O notation (also called Landau notation). More formally, if c1-math-0035 and c1-math-0036 are functions from positive integers to positive reals, then for any n large enough, it is said that

·     c1-math-0037 if there exists a constant c1-math-0038 such that c1-math-0039. This defines g as an asymptotic upper bound for f. Intuitively, it means that f grows no faster asymptotically than c1-math-0040 to within a constant multiple;

·     c1-math-0041 if there exists a constant c1-math-0042 such that c1-math-0043; This defines g as an asymptotic lower bound for f;

·     c1-math-0044 if c1-math-0045 and c1-math-0046; and This defines g as an asymptotic tight bound for f;

Therefore, if an algorithm takes for instance, c1-math-0047 steps on an input of size n, we simply say it runs with a time complexity bound of c1-math-0048Table 1.2 summarizes the typically met complexity functions and their associated names. AS Landau notation enables us to give the complexity to within a constant multiple, we can write c1-math-0057 without giving the base of the logarithm, knowing that the base multiplies the function by a constant (namely the logarithm of the base).

Exercise 1.4

What is the complexity of the fax code presented in the previous section?

Solution on page 282.

Table 1.2 Typical Complexity Bounds Obtained from an Algorithm Analysis

Function

Complexity

Example

c1-math-0049

Constant

Table lookup

c1-math-0050

Logarithmic

Binary search on a sorted vector

c1-math-0051

Linear

Traversing of an unsorted vector

c1-math-0052

Quasilinear

Quicksort, FFT (Section 1.4.2.4)

c1-math-0053

Polynomial

Naive square matrix multiplication – c1-math-0054

c1-math-0055

Exponential

Naive Fibonacci – c1-math-0056

In practice, every algorithm should have a linear complexity c1-math-0058 – where n is the size of the source message – to be “ real-time” efficient, that is, usable in a transmission when a long latency is not acceptable (telephony, multimedia, etc.).

One of the fields of cryptography relies on the hypothesis that there exists some problems where no algorithm with such complexity is known. For such problems, every known algorithm requests astronomical computing time – which makes their application impossible.

In this book, we consider that a problem is impossible to solve (one often uses the euphemism “ difficult”) when there is no known algorithm which solves it in humanly reasonable time, for instance, if its resolution would request more than c1-math-0059 operations.