Computer Vision: Models, Learning, and Inference (2012)
Part I
Probability
Chapter 3
Common probability distributions
In Chapter 2 we introduced abstract rules for manipulating probabilities. To use these rules we will need to define some probability distributions. The choice of distribution Pr(x) will depend on the domain of the data x that we are modeling (Table 3.1).
Data Type |
Domain |
Distribution |
univariate, discrete, binary |
x ∈ {0, 1} |
Bernoulli |
univariate, discrete, multivalued |
x ∈ {1,2,…, K} |
categorical |
univariate, continuous, unbounded |
x ∈ |
univariate normal |
univariate, continuous, bounded |
x ∈ [0,1] |
beta |
multivariate, continuous, unbounded |
x ∈ ^{K} |
multivariate normal |
multivariate, continuous, bounded, sums to one |
x = [x_{1}, x_{2},…, x_{K}]^{T} |
Dirichlet |
bivariate, continuous, |
x = [x_{1}, x_{2}] |
normal-scaled |
vector x and matrix X, |
x ∈ ^{K} |
normal |
Table 3.1 Common probability distributions: the choice of distribution depends on the type/domain of data to be modeled.
Probability distributions such as the categorical and normal distributions are obviously useful for modeling visual data. However, the need for some of the other distributions is not so obvious; for example, the Dirichlet distribution models K positive numbers that sum to one. Visual data do not normally take this form.
Distribution |
Domain |
Parameters Modeled by |
Bernoulli |
x ∈ {0, 1} |
beta |
categorical |
x ∈ {1, 2, …, K} |
Dirichlet |
univariate normal |
x ∈ |
normal inverse gamma |
multivariate normal |
x ∈ ^{k} |
normal inverse Wishart |
Table 3.2 Common distributions used for modeling (left) and their associated domains (center). For each of these distributions, there is a second associated distribution over the parameters (right).
The explanation is as follows: when we fit probability models to data, we need to know how uncertain we are about the fit. This uncertainty is represented as a probability distribution over the parameters of the fitted model. So for each distribution used for modeling, there is a second distribution over the associated parameters (Table 3.2). For example, the Dirichlet is used to model the parameters of the categorical distribution. In this context, the parameters of the Dirichlet would be known as hyperparameters. More generally, the hyperparameters determine the shape of the distribution over the parameters of the original distribution.
We will now work through the distributions in Table 3.2 before looking more closely at the relationship between these pairs of distributions.
3.1 Bernoulli distribution
The Bernoulli distribution (Figure 3.1) is a discrete distribution that models binary trials: it describes the situation where there are only two possible outcomes x ∈ {0, 1} which are referred to as “failure” and “success.” In machine vision, the Bernoulli distribution could be used to model the data. For example, it might describe the probability of a pixel taking an intensity value of greater or less than 128. Alternatively, it could be used to model the state of the world. For example, it might describe the probability that a face is present or absent in the image.
The Bernoulli has a single parameter λ ∈ [0, 1] which defines the probability of observing a success x = 1. The distribution is hence
We can alternatively express this as
and we will sometimes use the equivalent notation
Figure 3.1 Bernoulli distribution. The Bernoulli distribution is a discrete distribution with two possible outcomes, x ∈ {0, 1} which are referred to as failure and success, respectively. It is governed by a single parameter λ that determines the probability of success such that Pr(x = 0) = 1 − λ and Pr(x = 1) = λ.
3.2 Beta distribution
The beta distribution (Figure 3.2) is a continuous distribution defined on single parameter λ where λ ∈ [0, 1]. As such it is suitable for representing uncertainty in the parameter λ of the Bernoulli distribution.
Figure 3.2 Beta distribution. The beta distribution is defined on [0, 1] and has parameters (α, β) whose relative values determine the expected value so E [λ] = α/(α + β) (numbers in parentheses show the α, β for each curve). As the absolute values of (α, β) increase, the concentration around E[λ] increases. a) E[λ] = 0.5 for each curve, concentration varies. b) E[λ] = 0.25. c) E[λ] = 0.75.
The beta distribution has two parameters α, β ∈ [0, ∞], which both take positive values and affect the shape of the curve as indicated in Figure 3.2. Mathematically, the beta distribution has the form
where Γ[•] is the gamma function.^{1} For short, we abbreviate this to
3.3 Categorical distribution
The categorical distribution (Figure 3.3) is a discrete distribution that determines the probability of observing one of K possible outcomes. Hence, the Bernoulli distribution is a special case of the categorical distribution when there are only two outcomes. In machine vision the intensity data at a pixel is usually quantized into discrete levels and so can be modeled with a categorical distribution. The state of the world may also take one of several discrete values. For example an image of a vehicle might be classified into {car,motorbike,van,truck} and our uncertainty over this state could be described by a categorical distribution.
Figure 3.3 The categorical distribution is a discrete distribution with K possible outcomes, x ∈ {1, 2,…K} and K parameters λ_{1}, λ_{2},…,λ_{K} where λ_{k} ≥ 0 and ∑_{k} λ_{k}=1. Each parameter represents the probability of observing one of the outcomes, so that the probability of observing x = k is given by λ_{k}. When the number of possible outcomes K is 2, the categorical reduces to the Bernoulli distribution.
The probabilities of observing the K outcomes are held in a K × 1 parameter vector λ = [λ_{1}, λ_{2},…,λ_{K}], where λ_{k} ∈ [0, 1] and Σ^{K}_{k}_{=1} λ_{k} = 1. The categorical distribution can be visualized as a normalized histogram with K bins and can be written as
For short, we use the notation
Alternatively, we can think of the data as taking values x ∈ {e_{1}, e_{2},…,e_{K}} where e_{k} is the k^{th} unit vector; all elements of e_{k} are zero except the k^{th}, which is one. Here we can write
where x_{j} is the j^{th} element of x.
3.4 Dirichlet distribution
The Dirichlet distribution (Figure 3.4) is defined over K continuous values λ_{1}…λ_{K} where λ_{k} ∈ [0, 1] and Σ^{K}_{k}_{=1} λ_{k} = 1. Hence it is suitable for defining a distribution over the parameters of the categorical distribution.
In K dimensions the Dirichlet distribution has K parameters α_{1}…α_{K} each of which can take any positive value. The relative values of the parameters determine the expected values E[λ_{1}]…E[λ_{k}]. The absolute values determine the concentration around the expected value. We write
Figure 3.4 The Dirichlet distribution in K dimensions is defined on values λ_{1}, λ_{2},…,λ_{K}, such that ∑_{k},λ_{k} = 1 and λ_{k} ∈ [0, 1] ∀ k ∈ {1…K}. a) For K=3, this corresponds to a triangular section of the plane ∑_{k},λ_{k} = 1. In K dimensions, the Dirichlet is defined by K positive parameters α_{1…K}. The ratio of the parameters determines the expected value for the distribution. The absolute values determine the concentration: the distribution is highly peaked around the expected value at high parameter values but pushed away from the expected value at low parameter values. b–e) Ratio of parameters is equal, absolute values increase. f–i) Ratio of parameters favors α_{3} > α_{2} > α_{1}, absolute values increase.
or for short
Just as the Bernoulli distribution was a special case of the categorical distribution with two possible outcomes, so the beta distribution is a special case of the Dirichlet distribution where the dimensionality is two.
3.5 Univariate normal distribution
The univariate normal or Gaussian distribution (Figure 3.5) is defined on continuous values x ∈ [−∞,∞]. In vision, it is common to ignore the fact that the intensity of a pixel is quantized and model it with the continuous normal distribution. The world state may also be described by the normal distribution. For example, the distance to an object could be represented in this way.
The normal distribution has two parameters, the mean μ and the variance σ^{2}. The parameter μ can take any value and determines the position of the peak. The parameter σ^{2} takes only positive values and determines the width of the distribution. The normal distribution is defined as
and we will abbreviate this by writing
3.6 Normal-scaled inverse gamma distribution
The normal-scaled inverse gamma distribution (Figure 3.6) is defined over a pair of continuous values μ, σ^{2}, the first of which can take any value and the second of which is constrained to be positive. As such it can define a distribution over the mean and variance parameters of the normal distribution.
Figure 3.5 The univariate normal distribution is defined on x ∈ and has two parameters {μ, σ^{2}}. The mean parameter μ determines the expected value and the variance σ^{2} determines the concentration about the mean so that as σ^{2} increases, the distribution becomes wider and flatter.
Figure 3.6 The normal-scaled inverse gamma distribution defines a probability distribution over bivariate continuous values μ, σ^{2} where μ ∈ [−∞,∞] and σ^{2} ∈ [0,∞]. a) Distribution with parameters [α, β, γ, δ] = [1, 1, 1, 0]. b) Varying α. c) Varying β. d) Varying γ. e) Varying δ.
The normal-scaled inverse gamma has four parameters α, β, γ,δ where α, β, and γ, are positive real numbers but δ can take any value. It has pdf:
or for short
3.7 Multivariate normal distribution
The multivariate normal or Gaussian distribution models D-dimensional variables x where each of the D elements x_{1}…x_{D} is continuous and lies in the range [−∞, +∞] (Figure 3.7). As such the univariate normal distribution is a special case of the multivariate normal where the number of elements D is one. In machine vision the multivariate normal might model the joint distribution of the intensities of D pixels within a region of the image. The state of the world might also be described by this distribution. For example, the multivariate normal might describe the joint uncertainty in the 3D position (x, y, z) of an object in the scene.
Figure 3.7 The multivariate normal distribution models D-dimensional variables x = [x_{1}…x_{D}]^{T} where each dimension x_{d} is continuous and real. It is defined by a D × 1 vector μ defining the mean of the distribution and a D × D covariance matrix ∑ which determines the shape. The iso-contours of the distribution are ellipsoids where the center of the ellipsoid is determined by μ and the shape by ∑. This figure depicts a bivariate distribution, where the covariance is illustrated by drawing one of these ellipsoids.
The multivariate normal distribution has two parameters: the mean μ and covariance ∑. The mean μ is a D × 1 vector the distribution. The covariance ∑ is a symmetric D × D positive definite matrix so that z^{T} ∑z is positive for any real vector z. The probability density function has the following form
or for short
The multivariate normal distribution will be used extensively throughout this book, and we devote the whole of Chapter 5 to describing its properties.
3.8 Normal inverse Wishart distribution
The normal inverse Wishart distribution defines a distribution over a D × 1 vector μ and a D × D positive definite matrix ∑. As such it is suitable for describing uncertainty in the parameters of a multivariate normal distribution. The normal inverse Wishart has four parameters α, Ψ, γ, δ, where α and γ are positive scalars, δ is a D × 1 vector and Ψ is a positive definite D × D matrix
Figure 3.8 Sampling from 2D normal inverse Wishart distribution. a) Each sample consists of a mean vector and covariance matrix, here visualized with 2D ellipses illustrating the iso-contour of the associated Gaussian at a Mahalanobis distance of 2. b) Changing α modifies the dispersion of covariances observed. c) Changing Ψ modifies the average covariance. d) Changing γ modifies the dispersion of mean vectors observed. e) Changing δ modifies the average value of the mean vectors.
where Γ_{D}[•] is the multivariate gamma function and Tr[Ψ] returns the trace of the matrix Ψ (see Appendix C.2.4). For short we will write
The mathematical form of the normal inverse Wishart distribution is rather opaque. However, it is just a function that produces a positive value for any valid mean vector μ and covariance matrix ∑, such that when we integrate over all possible values of μ and ∑, the answer is one. It is hard to visualize the normal inverse Wishart, but easy to draw samples and examine them: each sample is the mean and covariance of a normal distribution (Figure 3.8).
3.9 Conjugacy
We have argued that the beta distribution can represent probabilities over the parameters of the Bernoulli. Similarly the Dirichlet defines a distribution over the parameters of the categorical, and there are analogous relationships between the normal-scaled inverse gamma and univariate normal and the normal inverse Wishart and the multivariate normal.
These pairs were carefully chosen because they have a special relationship: in each case, the former distribution is conjugate to the latter: the beta is conjugate to the Bernoulli and the Dirichlet is conjugate to the categorical and so on. When we multiply a distribution with its conjugate, the result is proportional to a new distribution which has the same form as the conjugate. For example
where κ is a scaling factor that is constant with respect to the variable of interest, λ. It is important to realize that this was not necessarily the case: if we had picked any distribution other than the beta, then this product would not have retained the same form. For this case, the relationship in Equation 3.19 is easy to prove
where in the third line we have both multiplied and divided by the constant associated with Beta_{λ} .
The conjugate relationship is important because we take products of distributions during both learning (fitting distributions) and evaluating the model (assessing the probability of new data under the fitted distribution). The conjugate relationship means that these products can both be computed neatly in closed form.
Summary
We use probability distributions to describe both the world state and the image data. We have presented four distributions (Bernoulli, categorical, univariate normal, multivariate normal) that are suited to this purpose. We also presented four other distributions (beta, Dirichlet, normal-scaled inverse gamma, normal inverse Wishart) that can be used to describe the uncertainty in parameters of the first; they can hence describe the uncertainty in the fitted model. These four pairs of distributions have a special relationship: each distribution from the second set is conjugate to one from the first set. As we shall see, the conjugate relationship makes it easier to fit these distributions to observed data and evaluate new data under the fitted model.
Notes
Throughout this book, I use rather esoteric terminology for discrete distributions. I distinguish between the binomial distribution (probability of getting M successes in N binary trials) and the Bernoulli distribution (the binary trial itself or probability of getting a success or failure in one trial) and talk exclusively about the latter distribution. I take a similar approach to discrete variables which can take K values. The multinomial distribution assigns a probability to observing the values {1, 2,…,K} with frequency {M_{1}, M_{2},…,M_{K}} given N trials. The categorical distribution is a special case of this with N = 1. Most other authors do not make this distinction and would term this “multinomial” as well.
A more complete list of common probability distributions and details of their properties are given in Appendix B of Bishop (2006). Further information about conjugacy can be found in Chapter 2 of Bishop (2006) or any textbook on Bayesian methods, such as that of Gelman et al. (2004). Much more information about the normal distribution is provided in Chapter 5 of this book.
Problems
3.1 Consider a variable x which is Bernoulli distributed with parameter λ. Show that the mean E[x] is λ and the variance E[x − E[x])^{2}] is λ(1 − λ).
3.2 Calculate an expression for the mode (position of the peak) of the beta distribution with α, β > 1 in terms of the parameters α and β.
3.3 The mean and variance of the beta distribution are given by the expressions
We may wish to choose the parameters α and β so that the distribution has a particular mean μ and σ^{2}. Derive suitable expressions for α and β in terms of μ and α^{2}.
3.4 All of the distributions in this chapter are members of the exponential family and can be written in the form
where a[x] and c[x] are functions of the data and b[θ] and d[θ] are functions of the parameters. Find the functions a[x], b[θ], c[x] and d[θ] that allow the beta distribution to be represented in the generalized form of the exponential family.
3.5 Use integration by parts to prove that if
then
3.6 Consider a restricted family of univariate normal distributions where the variance is always 1, so that
Show that a normal distribution over the parameter μ
has a conjugate relationship to the restricted normal distribution.
3.7 For the normal distribution, find the functions a[x], b[θ], c[x], and d[θ] that allow it to be represented in the generalized form of the exponential family (see Problem 3.4).
3.8 Calculate an expression for the mode (position of the peak in μ, σ^{2} space) of the normal-scaled inverse gamma distribution in terms of the parameters α, β, γ, δ.
3.9 Show that the more general form of the conjugate relation in which we multiply I Bernoulli distributions by the conjugate beta prior is given by
where
3.10 Prove the conjugate relation
where
and N_{k} is the total number of times that the variable took the value k.
3.11 Show that the conjugate relation between the normal and normal inverse gamma is given by
where
3.12 Show that the conjugate relationship between the multivariate normal and the normal inverse Wishart is given by
where
You may need to use the relation Tr[zz^{T}A^{−1}] = z^{T} A ^{−1}z.
___________________________________
^{1} The gamma function is defined as Γa[z] = ∫_{0}^{∞} t^{z}^{−1} e^{−t} dt and is closely related to factorials, so that for positive integers Γ[z] = (z − 1)! and Γ[z + 1] = zΓ[z].