Computer Vision: Models, Learning, and Inference (2012)

Part I

Probability

Chapter 3

Common probability distributions

In Chapter 2 we introduced abstract rules for manipulating probabilities. To use these rules we will need to define some probability distributions. The choice of distribution Pr(x) will depend on the domain of the data x that we are modeling (Table 3.1).

Data Type

Domain

Distribution

univariate, discrete, binary

x ∈ {0, 1}

Bernoulli

univariate, discrete, multivalued

x ∈ {1,2,…, K}

categorical

univariate, continuous, unbounded

x ∈ images

univariate normal

univariate, continuous, bounded

x ∈ [0,1]

beta

multivariate, continuous, unbounded

x ∈ imagesK

multivariate normal

multivariate, continuous, bounded, sums to one

x = [x1x2,…, xK]T
xk ∈ [0, 1], ΣKk=1 xk = 1

Dirichlet

bivariate, continuous,
x1 unbounded,
x2 bounded below

x = [x1x2]
x1 ∈ images
x2images+

normal-scaled
inverse gamma

vector x and matrix X,
x unbounded,
X square, positive definite

x ∈ imagesK
X ∈ imagesK×K
zTXz > 0 ∀ z ∈ imagesK

normal
inverse Wishart

Table 3.1 Common probability distributions: the choice of distribution depends on the type/domain of data to be modeled.

Probability distributions such as the categorical and normal distributions are obviously useful for modeling visual data. However, the need for some of the other distributions is not so obvious; for example, the Dirichlet distribution models K positive numbers that sum to one. Visual data do not normally take this form.

Distribution

Domain

Parameters Modeled by

Bernoulli

x ∈ {0, 1}

beta

categorical

x ∈ {1, 2, …, K}

Dirichlet

univariate normal

x ∈ images

normal inverse gamma

multivariate normal

x ∈ imagesk

normal inverse Wishart

Table 3.2 Common distributions used for modeling (left) and their associated domains (center). For each of these distributions, there is a second associated distribution over the parameters (right).

The explanation is as follows: when we fit probability models to data, we need to know how uncertain we are about the fit. This uncertainty is represented as a probability distribution over the parameters of the fitted model. So for each distribution used for modeling, there is a second distribution over the associated parameters (Table 3.2). For example, the Dirichlet is used to model the parameters of the categorical distribution. In this context, the parameters of the Dirichlet would be known as hyperparameters. More generally, the hyperparameters determine the shape of the distribution over the parameters of the original distribution.

We will now work through the distributions in Table 3.2 before looking more closely at the relationship between these pairs of distributions.

3.1 Bernoulli distribution

The Bernoulli distribution (Figure 3.1) is a discrete distribution that models binary trials: it describes the situation where there are only two possible outcomes x ∈ {0, 1} which are referred to as “failure” and “success.” In machine vision, the Bernoulli distribution could be used to model the data. For example, it might describe the probability of a pixel taking an intensity value of greater or less than 128. Alternatively, it could be used to model the state of the world. For example, it might describe the probability that a face is present or absent in the image.

The Bernoulli has a single parameter λ ∈ [0, 1] which defines the probability of observing a success x = 1. The distribution is hence

images

We can alternatively express this as

images

and we will sometimes use the equivalent notation

images

images

Figure 3.1 Bernoulli distribution. The Bernoulli distribution is a discrete distribution with two possible outcomes, x ∈ {0, 1} which are referred to as failure and success, respectively. It is governed by a single parameter λ that determines the probability of success such that Pr(x = 0) = 1 − λ and Pr(x = 1) = λ.

3.2 Beta distribution

The beta distribution (Figure 3.2) is a continuous distribution defined on single parameter λ where λ ∈ [0, 1]. As such it is suitable for representing uncertainty in the parameter λ of the Bernoulli distribution.

images

Figure 3.2 Beta distribution. The beta distribution is defined on [0, 1] and has parameters (αβ) whose relative values determine the expected value so E [λ] = α/(α + β) (numbers in parentheses show the α, β for each curve). As the absolute values of (αβ) increase, the concentration around E[λ] increases. a) E[λ] = 0.5 for each curve, concentration varies. b) E[λ] = 0.25. c) E[λ] = 0.75.

The beta distribution has two parameters αβ ∈ [0, ∞], which both take positive values and affect the shape of the curve as indicated in Figure 3.2. Mathematically, the beta distribution has the form

images

where Γ[•] is the gamma function.1 For short, we abbreviate this to

images

3.3 Categorical distribution

The categorical distribution (Figure 3.3) is a discrete distribution that determines the probability of observing one of K possible outcomes. Hence, the Bernoulli distribution is a special case of the categorical distribution when there are only two outcomes. In machine vision the intensity data at a pixel is usually quantized into discrete levels and so can be modeled with a categorical distribution. The state of the world may also take one of several discrete values. For example an image of a vehicle might be classified into {car,motorbike,van,truck} and our uncertainty over this state could be described by a categorical distribution.

images

Figure 3.3 The categorical distribution is a discrete distribution with K possible outcomes, x ∈ {1, 2,…K} and K parameters λ1, λ2,…,λK where λk ≥ 0 and ∑k λk=1. Each parameter represents the probability of observing one of the outcomes, so that the probability of observing x = k is given by λk. When the number of possible outcomes K is 2, the categorical reduces to the Bernoulli distribution.

The probabilities of observing the K outcomes are held in a K × 1 parameter vector λ = [λ1, λ2,…,λK], where λk ∈ [0, 1] and ΣKk=1 λk = 1. The categorical distribution can be visualized as a normalized histogram with K bins and can be written as

images

For short, we use the notation

images

Alternatively, we can think of the data as taking values x ∈ {e1e2,…,eK} where ek is the kth unit vector; all elements of ek are zero except the kth, which is one. Here we can write

images

where xj is the jth element of x.

3.4 Dirichlet distribution

The Dirichlet distribution (Figure 3.4) is defined over K continuous values λ1…λK where λk ∈ [0, 1] and ΣKk=1 λk = 1. Hence it is suitable for defining a distribution over the parameters of the categorical distribution.

In K dimensions the Dirichlet distribution has K parameters α1αK each of which can take any positive value. The relative values of the parameters determine the expected values E[λ1]…E[λk]. The absolute values determine the concentration around the expected value. We write

images

images

Figure 3.4 The Dirichlet distribution in K dimensions is defined on values λ1, λ2,…,λK, such that ∑kk = 1 and λk ∈ [0, 1] ∀ k ∈ {1…K}. a) For K=3, this corresponds to a triangular section of the plane ∑kk = 1. In K dimensions, the Dirichlet is defined by K positive parameters α1…K. The ratio of the parameters determines the expected value for the distribution. The absolute values determine the concentration: the distribution is highly peaked around the expected value at high parameter values but pushed away from the expected value at low parameter values. b–e) Ratio of parameters is equal, absolute values increase. f–i) Ratio of parameters favors α3 > α2 > α1, absolute values increase.

or for short

images

Just as the Bernoulli distribution was a special case of the categorical distribution with two possible outcomes, so the beta distribution is a special case of the Dirichlet distribution where the dimensionality is two.

3.5 Univariate normal distribution

The univariate normal or Gaussian distribution (Figure 3.5) is defined on continuous values x ∈ [−∞,∞]. In vision, it is common to ignore the fact that the intensity of a pixel is quantized and model it with the continuous normal distribution. The world state may also be described by the normal distribution. For example, the distance to an object could be represented in this way.

The normal distribution has two parameters, the mean μ and the variance σ2. The parameter μ can take any value and determines the position of the peak. The parameter σ2 takes only positive values and determines the width of the distribution. The normal distribution is defined as

images

and we will abbreviate this by writing

images

3.6 Normal-scaled inverse gamma distribution

The normal-scaled inverse gamma distribution (Figure 3.6) is defined over a pair of continuous values μσ2, the first of which can take any value and the second of which is constrained to be positive. As such it can define a distribution over the mean and variance parameters of the normal distribution.

images

Figure 3.5 The univariate normal distribution is defined on x ∈ images and has two parameters {μ, σ2}. The mean parameter μ determines the expected value and the variance σ2 determines the concentration about the mean so that as σ2 increases, the distribution becomes wider and flatter.

images

Figure 3.6 The normal-scaled inverse gamma distribution defines a probability distribution over bivariate continuous values μ, σ2 where μ ∈ [−∞,∞] and σ2 ∈ [0,∞]. a) Distribution with parameters [αβ, γ, δ] = [1, 1, 1, 0]. b) Varying α. c) Varying β. d) Varying γ. e) Varying δ.

The normal-scaled inverse gamma has four parameters αβ, γ,δ where αβ, and γ, are positive real numbers but δ can take any value. It has pdf:

images

or for short

images

3.7 Multivariate normal distribution

The multivariate normal or Gaussian distribution models D-dimensional variables x where each of the D elements x1xD is continuous and lies in the range [−∞, +∞] (Figure 3.7). As such the univariate normal distribution is a special case of the multivariate normal where the number of elements D is one. In machine vision the multivariate normal might model the joint distribution of the intensities of D pixels within a region of the image. The state of the world might also be described by this distribution. For example, the multivariate normal might describe the joint uncertainty in the 3D position (xyz) of an object in the scene.

images

Figure 3.7 The multivariate normal distribution models D-dimensional variables x = [x1xD]T where each dimension xd is continuous and real. It is defined by a D × 1 vector μ defining the mean of the distribution and a D × D covariance matrix  which determines the shape. The iso-contours of the distribution are ellipsoids where the center of the ellipsoid is determined by μ and the shape by . This figure depicts a bivariate distribution, where the covariance is illustrated by drawing one of these ellipsoids.

The multivariate normal distribution has two parameters: the mean μ and covariance . The mean μ is a D × 1 vector the distribution. The covariance  is a symmetric D × D positive definite matrix so that zT ∑z is positive for any real vector z. The probability density function has the following form

images

or for short

images

The multivariate normal distribution will be used extensively throughout this book, and we devote the whole of Chapter 5 to describing its properties.

3.8 Normal inverse Wishart distribution

The normal inverse Wishart distribution defines a distribution over a D × 1 vector μ and a D × D positive definite matrix . As such it is suitable for describing uncertainty in the parameters of a multivariate normal distribution. The normal inverse Wishart has four parameters αΨγ, δ, where α and γ are positive scalars, δ is a D × 1 vector and Ψ is a positive definite D × D matrix

images

images

Figure 3.8 Sampling from 2D normal inverse Wishart distribution. a) Each sample consists of a mean vector and covariance matrix, here visualized with 2D ellipses illustrating the iso-contour of the associated Gaussian at a Mahalanobis distance of 2. b) Changing α modifies the dispersion of covariances observed. c) Changing Ψ modifies the average covariance. d) Changing γ modifies the dispersion of mean vectors observed. e) Changing δ modifies the average value of the mean vectors.

where ΓD[•] is the multivariate gamma function and Tr[Ψ] returns the trace of the matrix Ψ (see Appendix C.2.4). For short we will write

images

The mathematical form of the normal inverse Wishart distribution is rather opaque. However, it is just a function that produces a positive value for any valid mean vector μ and covariance matrix , such that when we integrate over all possible values of μ and , the answer is one. It is hard to visualize the normal inverse Wishart, but easy to draw samples and examine them: each sample is the mean and covariance of a normal distribution (Figure 3.8).

3.9 Conjugacy

We have argued that the beta distribution can represent probabilities over the parameters of the Bernoulli. Similarly the Dirichlet defines a distribution over the parameters of the categorical, and there are analogous relationships between the normal-scaled inverse gamma and univariate normal and the normal inverse Wishart and the multivariate normal.

These pairs were carefully chosen because they have a special relationship: in each case, the former distribution is conjugate to the latter: the beta is conjugate to the Bernoulli and the Dirichlet is conjugate to the categorical and so on. When we multiply a distribution with its conjugate, the result is proportional to a new distribution which has the same form as the conjugate. For example

images

where κ is a scaling factor that is constant with respect to the variable of interest, λ. It is important to realize that this was not necessarily the case: if we had picked any distribution other than the beta, then this product would not have retained the same form. For this case, the relationship in Equation 3.19 is easy to prove

images

where in the third line we have both multiplied and divided by the constant associated with Betaλ images.

The conjugate relationship is important because we take products of distributions during both learning (fitting distributions) and evaluating the model (assessing the probability of new data under the fitted distribution). The conjugate relationship means that these products can both be computed neatly in closed form.

Summary

We use probability distributions to describe both the world state and the image data. We have presented four distributions (Bernoulli, categorical, univariate normal, multivariate normal) that are suited to this purpose. We also presented four other distributions (beta, Dirichlet, normal-scaled inverse gamma, normal inverse Wishart) that can be used to describe the uncertainty in parameters of the first; they can hence describe the uncertainty in the fitted model. These four pairs of distributions have a special relationship: each distribution from the second set is conjugate to one from the first set. As we shall see, the conjugate relationship makes it easier to fit these distributions to observed data and evaluate new data under the fitted model.

Notes

Throughout this book, I use rather esoteric terminology for discrete distributions. I distinguish between the binomial distribution (probability of getting M successes in N binary trials) and the Bernoulli distribution (the binary trial itself or probability of getting a success or failure in one trial) and talk exclusively about the latter distribution. I take a similar approach to discrete variables which can take K values. The multinomial distribution assigns a probability to observing the values {1, 2,…,K} with frequency {M1M2,…,MK} given N trials. The categorical distribution is a special case of this with N = 1. Most other authors do not make this distinction and would term this “multinomial” as well.

A more complete list of common probability distributions and details of their properties are given in Appendix B of Bishop (2006). Further information about conjugacy can be found in Chapter 2 of Bishop (2006) or any textbook on Bayesian methods, such as that of Gelman et al. (2004). Much more information about the normal distribution is provided in Chapter 5 of this book.

Problems

3.1 Consider a variable x which is Bernoulli distributed with parameter λ. Show that the mean E[x] is λ and the variance E[x − E[x])2] is λ(1 − λ).

3.2 Calculate an expression for the mode (position of the peak) of the beta distribution with αβ > 1 in terms of the parameters α and β.

3.3 The mean and variance of the beta distribution are given by the expressions

informalequation

We may wish to choose the parameters α and β so that the distribution has a particular mean μ and σ2. Derive suitable expressions for α and β in terms of μ and α2.

3.4 All of the distributions in this chapter are members of the exponential family and can be written in the form

informalequation

where a[x] and c[x] are functions of the data and b[θ] and d[θ] are functions of the parameters. Find the functions a[x], b[θ], c[x] and d[θ] that allow the beta distribution to be represented in the generalized form of the exponential family.

3.5 Use integration by parts to prove that if

informalequation

then

informalequation

3.6 Consider a restricted family of univariate normal distributions where the variance is always 1, so that

informalequation

Show that a normal distribution over the parameter μ

informalequation

has a conjugate relationship to the restricted normal distribution.

3.7 For the normal distribution, find the functions a[x], b[θ], c[x], and d[θ] that allow it to be represented in the generalized form of the exponential family (see Problem 3.4).

3.8 Calculate an expression for the mode (position of the peak in μσ2 space) of the normal-scaled inverse gamma distribution in terms of the parameters αβ, γ, δ.

3.9 Show that the more general form of the conjugate relation in which we multiply I Bernoulli distributions by the conjugate beta prior is given by

informalequation

where

informalequation

3.10 Prove the conjugate relation

informalequation

where

informalequation

and Nk is the total number of times that the variable took the value k.

3.11 Show that the conjugate relation between the normal and normal inverse gamma is given by

informalequation

where

informalequation

3.12 Show that the conjugate relationship between the multivariate normal and the normal inverse Wishart is given by

informalequation

where

informalequation

You may need to use the relation Tr[zzTA−1] = zT A −1z.

___________________________________

1 The gamma function is defined as Γa[z] = 0 tz−1 et dt and is closely related to factorials, so that for positive integers Γ[z] = (z − 1)! and Γ[z + 1] = zΓ[z].