Foundations of Coding: Compression, Encryption, Error Correction (2015)

Chapter 2. Information Theory and Compression

The previous chapter already contains several coding principles for compression: fax coding, instantaneous coding, the principle of entropy, and so on. We now carry out a deeper and more exhaustive survey of these principles, from the understanding of why compression is theoretically possible to examples of compression codes which are commonly used in computer science and in everyday life for storing and exchanging music and videos.

The objective is to improve the transmission time of each message as well as the storage space. For this purpose, one has to build codes that optimize the size of messages.

Here we assume that the channel is not subject to perturbations (one says that encoding is without noise); error handling will be studied in the last chapter. We are going to build encoding techniques, which enable one to choose efficient codes, as well as an important theory in order to quantify the information included in a message, to compute the minimum size of an encoding scheme and thus to determine the “ value” of a given code.

We first focus on lossless data compression, that is, compression followed by decompression does not modify the original file. The first section describes the theoretical limits of lossless compression, and the second algorithms to approach those limits. Then the third section shows how changes of representations can expand these presupposed boundaries, with applications to usual codes.

Then, at the end of this chapter, we introduce several techniques for the compression of images or sounds that allow some loss on the visual or the audio quality by modeling human perception.

Exercise 2.1 (It is Impossible to Compress ALL Files Without Loss)

1.  How many distinct files of size exactly N bits are there?

2.  How many distinct files of size strictly lower than N bits are there?

3.  Conclude on the existence of a method that will reduce the size of any file.

Solution on page 295.

Exercise 2.2 (On The Scarcity of Files Compressible Without Loss)

Show that less than one N-bits file over a million, with c2-math-0001 is compressible by more than 20 bits, without loss of information. Solution on page 295.

2.1 Information Theory

Information theory gives the mathematical framework for compression codes. Recall that an alphabet is a finite set on which messages are composed, containing the information to encode or the information already encoded. The set of letters of the source message (data compression is often called source coding) is the source alphabet, and the set of code letters is the code alphabet. For example, the Latin alphabet is the set of letters we are using to write this text, and c2-math-0002 is the alphabet used to write the messages that are transmitted through most of the numerical channels.

The set of finite strings over some alphabet V is denoted by c2-math-0003, and the image of the source alphabet through the encoding function is a subset of c2-math-0004 called the set of codewords, or sometimes also simply called the code, especially in information theory.

Therefore, a code C over some alphabet V is a finite subset of c2-math-0005. The code is composed of the basic elements from which messages are built. An element m in C is called a codeword. Its length is denoted by c2-math-0006. The arity of the code is the cardinal number of V. A code of arity 2 is said to be binary.

For example, c2-math-0007 is a binary code of arity 2, over the alphabet c2-math-0008.

2.1.1 Average Length of a Code

In all this section, for the sake of simplicity and because of their practical importance in telecommunications, we mainly focus on binary codes. Nevertheless, most of the following results can be applied to any code.

As all codewords are not always of the same size, one uses a measure dependent on the frequencies of appearance in order to evaluate the length of the messages that will encode the source. One recalls that a source of information is composed of an alphabet S and a probability distribution c2-math-0009 over S. For a symbol c2-math-0010 in a source c2-math-0011 is the probability ofoccurrence of c2-math-0012.

Let c2-math-0013 with c2-math-0014, and let C be a code in c2-math-0015, whose encoding function is f (C is the image of S through f). The average length of the code C is


Example 2.3


If c2-math-0018, the average length of the scheme is 2.

If c2-math-0019, then the average length of the scheme is c2-math-0020.

One uses the average length of an encoding scheme in order to measure its efficiency.

2.1.2 Entropy as a Measure of the Amount of Information

We are reaching the fundamental notions of information theory. Let us consider a source c2-math-0021. One only knows the probability distribution of this source, and one wishes to measure quantitatively his/her ignorance concerning the behavior of c2-math-0022. For instance, this uncertainty is higher if the number of symbols in S is large. It is low if the probability of occurrence of a symbol is close to 1, and it reaches its highest value if the distribution is uniform.

One uses the entropy of a source in order to measure the average amount of information issued by this source.

For example, let us imagine a fair die whose value is only given by comparison with a number we are able to choose: how many questions are required in order to determine the value of the die? If one proceeds by dichotomy, it only takes c2-math-0023 questions. Now let us suppose that the die is unfair: one has a probability 1 over 2 and each of the five other values has a probability 1 over 10. If the first question is “ is it 1?” then in half of the cases, this question is enough to determine the value of the die. For the other half, it will require three additional questions. Hence, the average number of questions required is c2-math-0024.

Actually, it is still possible to refine this result by noticing that three questions are not always required in order to determine the right value among c2-math-0025: if dichotomy splits these five possibilities into two groups c2-math-0026 and c2-math-0027, then only two additional questions will be required in order to find 2 or 3. Only 5 and 6 will require three questions to be separated. For a large number of draws, it is still possible to improve this method if the questions do not always split the set the same way, for example, in c2-math-0028 and c2-math-0029, so that two questions are alternatively required and so on. By extending this reasoning, one shows that the average number of questions required for the five possibilities is equal to c2-math-0030.

Hence, the amount of information included in this throw of die (which can be easily applied generally to any source) is defined intuitively by the average number of questions.

Formally, entropy is defined, for a source c2-math-0031, as


This is a measure of the uncertainty relying on a probability law, which is always illustrated using the example of the die: one considers the random variable (source) generated by the throw of an n-face die. We have seen that there is more uncertainty in the result of this experiment if the die is fair than if the die is unfair. This can be written in the following way: for all c2-math-0033, according to Property 1 in Chapter 1.

2.1.3 Shannon's Theorem

This fundamental theorem of information theory is known as Shannon's theorem or the theorem of noiseless encoding.

First of all, we formulate the theorem when considering a source without memory.


Theorem 20

Let c2-math-0034 be a source without memory of entropy c2-math-0035. Any code uniquely decodable of c2-math-0036 over an alphabet V of size q (i.e., c2-math-0037), and an average length l, satisfies


Moreover, there exists a code uniquely decodable of c2-math-0039 over an alphabet of size q, and an average length l, that satisfies



First part: Let c2-math-0041 be a code of c2-math-0042, uniquely decodable, over some alphabet of size q, and let c2-math-0043 be the lengths of the words in C. If c2-math-0044, then c2-math-0045 from Kraft's theorem (see page 69). Let c2-math-0046 be such that c2-math-0047 for all c2-math-0048. One has c2-math-0049 for all i, and c2-math-0050, thus c2-math-0051 is a probability distribution. Gibbs' lemma (see page 17) can be applied, and one obtains


in other words,


Yet, because c2-math-0054, one has c2-math-0055; Hence, the result.

Second part: Let c2-math-0056. As c2-math-0057 (indeed, c2-math-0058), there exists a code of c2-math-0059 over an alphabet of size q, uniquely decodable, with lengths of codewords equal to c2-math-0060. Its average length is c2-math-0061. Then, the property of the ceiling function gives us c2-math-0062 and, as a consequence,


This can be written as c2-math-0064. This proves the theorem.

One deduces the theorem for the kth extension of c2-math-0065:


Theorem 21

Let c2-math-0066 be a source without memory of entropy c2-math-0067. Any uniquely decodable code in c2-math-0068 over some alphabet of size q, and an average length c2-math-0069, satisfies


Moreover, there exists a uniquely decodable code in c2-math-0071 over some alphabet of size q, and an average length c2-math-0072, that satisfies



The proof of the theorem is immediate: according to Property 2 on page 21, one has c2-math-0074.

For any stationary source, the theorem can be formulated as follows:


Theorem 22

For any stationary source c2-math-0075 of entropy c2-math-0076, there exists some uniquely decodable encoding process over an alphabet of size q, and an average length l, as close as one wants to its lower bound:


In theory, it is then possible to find a code endlessly approaching this bound (which depends on the entropy of the source). However, in practice, if the process consists of encoding the words of an extension of the source, one is obviously limited by the number of these words (c2-math-0078, which might represent a large number of words). In the sequel, we see several encoding processes as well as their relation with this theoretical bound.