Computer Vision: Models, Learning, and Inference (2012)

Part I

Probability

Chapter 4

Fitting probability models

This chapter concerns fitting probability models to data {xi}Ii=1. This process is referred to as learning because we learn about the parameters θ of the model.1 It also concerns calculating the probability of a new datum x* under the resulting model. This is known as evaluating the predictive distribution. We consider three methods: maximum likelihood, maximum a posteriori, and the Bayesian approach.

4.1 Maximum likelihood

As the name suggests, the maximum likelihood (ML) method finds the set of parameters equation under which the data {xi}Ii=1 are most likely. To calculate the likelihood function Pr(xi|θ) at a single data point xi, we simply evaluate the probability density function at xi. Assuming each data point was drawn independently from the distribution, the likelihood function Pr(x1…I|θ) for a set of points is the product of the individual likelihoods. Hence, the ML estimate of the parameters is

images

where argmaxθ f[θ] returns the value of θ that maximizes the argument f[θ].

To evaluate the predictive distribution for a new data point x* (compute the probability that x* belongs to the fitted model), we simply evaluate the probability density function Pr(x*|images) using the ML fitted parameters images.

4.2 Maximum a posteriori

In maximum a posteriori (MAP) fitting, we introduce prior information about the parameters θ. From previous experience we may know something about the possible parameter values. For example, in a time-sequence the values of the parameters at time t tell us a lot about the possible values at time t+1, and this information would be encoded in the prior distribution.

As the name suggests, maximum a posteriori estimation maximizes the posterior probability Pr(θ|x1…I) of the parameters

images

where we have used Bayes’ rule between the first two lines and subsequently assumed independence. In fact, we can discard the denominator as it is constant with respect to the parameters and so does not affect the position of the maximum, and we get

images

Comparing this to the maximum likelihood criterion (Equation 4.1), we see that it is identical except for the additional prior term; maximum likelihood is a special case of maximum a posteriori where the prior is uninformative.

The predictive density (probability of a new datum x* under the fitted model) is again calculated by evaluating the pdf Pr(x*|images) using the new parameters.

4.3 The Bayesian approach

In the Bayesian approach we stop trying to estimate single fixed values (point estimates) of the parameters θ and admit what is obvious; there may be many values of the parameters that are compatible with the data. We compute a probability distribution Pr(θ|x1…I) over the parameters θ based on data {xi}Ii=1 using Bayes’ rule so that

images

Evaluating the predictive distribution is more difficult for the Bayesian case since we have not estimated a single model but have instead found a probability distribution over possible models. Hence, we calculate

images

which can be interpreted as follows: the term Pr(x*|θ) is the prediction for a given value of θ. So, the integral can be thought of as a weighted sum of the predictions given by different parameters θ, where the weighting is determined by the posterior probability distribution Pr(θ|x1…I) over the parameters (representing our confidence that different parameter values are correct).

The predictive density calculations for the Bayesian, MAP, and ML cases can be unified if we consider the ML and MAP estimates to be special probability distributions over the parameters where all of the density is at images. More formally, we can consider them as delta functions centered at images. A delta function δ[z] is a function that integrates to one, and that returns zero everywhere except at z = 0. We can now write

images

which is exactly the calculation we originally prescribed: we simply evaluate the probability of the data under the model with the estimated parameters.

4.4 Worked example 1: Univariate normal

To illustrate the above ideas, we will consider fitting a univariate normal model to scalar data {xi}Ii=1. Recall that the univariate normal model has pdf

images

and has two parameters, the mean μ and the variance σ2. Let us generate I independent data points {xi}Ii=1 from a univariate normal with μ = 1 and σ2 = 1. Our goal is to reestimate these parameters from the data.

4.4.1 Maximum likelihood estimation

The likelihood Pr(x1…I|μσ2) of the parameters {μσ2} for observed data {xi}Ii=1 is computed by evaluating the pdf for each data point separately and taking the product:

informalequation

images

Figure 4.1 Maximum likelihood fitting. The likelihood of the parameters for a single datapoint is the height of the pdf evaluated at that point (blue vertical lines). The likelihood of a set of independently sampled data is the product of the individual likelihoods. a) The likelihood for this normal distribution is low because the large variance means the height of the pdf is low everywhere. b) The likelihood for this normal distribution is even lower as the left-most datum is very unlikely under the model. c) The maximum likelihood solution is the set of parameters for which the likelihood is maximized.

images

Figure 4.2 The likelihood function for a fixed set of observed data is a function of the mean μ and variance σ2 parameters. The plot shows that there are many parameter settings which might plausibly be responsible for the ten data points from Figure 4.1. A sensible choice for the “best” parameter setting is the maximum likelihood solution (green cross), which corresponds to the maximum of this function.

images

Obviously, the likelihood for some sets of parameters {μσ2} will be higher than others (Figure 4.1), and it is possible to visualize this as a 2D function of the mean μ and variance σ2 (Figure 4.2). The maximum likelihood solution imagesimages will occur at the peak of this surface so that

images

In principle we can maximize this by taking the derivative of Equation 4.8 with respect to μ and σ2, equating the result to zero and solving. In practice, however, the resulting equations are messy. To simplify things, we work instead with the logarithm of this expression (the log likelihood, L). Since the logarithm is a monotonic function (Figure 4.3), the position of the maximum in the transformed function remains the same. Algebraically, the logarithm turns the product of the likelihoods of the individual data points into a sum and so decouples the contribution of each. The ML parameters can now be calculated as

images

To maximize, we differentiate this log likelihood L with respect to μ and equate the result to zero

images

images

Figure 4.3 The logarithm is a monotonic transformation. If one point is higher than another, then it will also be higher after transformation by the logarithmic function. It follows that if we transform the surface in Figure 4.2 through the logarithmic function, the maximum will remain in the same position.

and rearranging, we see that

images

By a similar process, the expression for the variance can be shown to be

images

These expressions are hardly surprising, but the same idea can be used to estimate parameters in other distributions where the results are less familiar.

Figure 4.1 shows a set of data points and three possible fits to the data. The mean of the maximum likelihood fit is the mean of the data. The ML fit is neither too narrow (giving very low probabilities to the furthest data points from the mean) nor too wide (resulting in a flat distribution and giving low probability to all points).

Least squares fitting

As an aside, we note that many texts discuss fitting in terms of least squares. Consider fitting just the mean parameter μ of the normal distribution using maximum likelihood. Manipulating the cost function so that

images

leads to a formulation where we minimize the sum of squares. In other words, least squares fitting is equivalent to fitting the mean parameter of a normal distribution using the maximum likelihood method.

4.4.2 Maximum a posteriori estimation

Returning to the main thread, we will now demonstrate maximum a posteriori fitting of the parameters of the normal distribution. The cost function becomes

images

where we have chosen normal inverse gamma prior with parameters αβγδ (Figure 4.4) as this is conjugate to the normal distribution. The expression for the prior is

images

The posterior distribution is proportional to the product of the likelihood and the prior (Figure 4.5), and has the highest density in regions that both agree with the data and were a priori plausible.

Like the ML case, it is easier to maximize the logarithm of Equation 4.15:

images

images

Figure 4.4 Prior over normal parameters. a) A normal inverse gamma with αβγ = 1 and δ = 0 gives a broad prior distribution over univariate normal parameters. The magenta cross indicates the peak of this prior distribution. The blue crosses are five samples randomly drawn from the distribution. b) The peak and the samples can be visualized by plotting the normal distributions that they represent.

images

Figure 4.5 MAP inference for normal parameters. a) The likelihood function is multiplied by b) the prior probability to give a new function c) that is proportional to the posterior distribution. The maximum a posteriori (MAP) solution (cyan cross) is found at the peak of the posterior distribution. It lies between the maximum likelihood (ML) solution (green cross) and the maximum of the prior distribution (MP, magenta cross).

To find the MAP parameters, we substitute in the expressions, differentiate with respect to μ and σ, equate to zero, and rearrange to give

images

The formula for the mean can be more easily understood if we write it as

images

This is a weighted sum of two terms. The first term is the data mean images and is weighted by the number of training examples I. The second term is δ, the value of μ favored by the prior, and is weighted by γ.

This gives some insight into the behavior of the MAP estimate (Figure 4.6). With a large amount of data, the first term dominates, and the MAP estimate images is very close to the data mean (and the ML estimate). With intermediate amounts of data, images is a weighted sum of the prediction from the data and the prediction from the prior. With no data at all, the estimate is completely governed by the prior. The hyperparameter (parameter of the prior) γ controls the concentration of the prior with respect to μ and determines the extent of its influence. Similar conclusions can be drawn about the MAP estimate of the variance.

Where there is a single data point (Figure 4.6e–f), the data tells us nothing about the variance and the maximum likelihood estimate images2 is actually zero; the best fit is an infinitely thin and infinitely tall normal distribution centered on the one data point. This is unrealistic, not least because it accords the datum an infinite likelihood. However, MAP estimation is still valid as the prior ensures that sensible parameter values are chosen.

images

Figure 4.6 Maximum a posteriori estimation. a) Posterior distribution over parameters μ and σ2. MAP solution (cyan cross) lies between ML (green cross) and the peak of the prior (purple cross). b) Normal distributions corresponding to MAP solution, ML solution and peak of prior. c–d) With fewer data points, the prior has a greater effect on the final solution. e–f) With only one data point, the maximum likelihood solution cannot be computed (you cannot calculate the variance of a single point). However, the MAP solution can still be calculated.

4.4.3 The Bayesian approach

In the Bayesian approach, we calculate a posterior distribution Pr(μσ2|x1…I) over possible parameter values using Bayes’ rule,

images

where we have used the conjugate relationship between likelihood and prior (Section 3.9) and κ is the associated constant. The product of the normal likelihood and normal inverse gamma prior creates a posterior over μ and σ2, which is a new normal inverse gamma distribution and can be shown to have parameters

images

images

Figure 4.7 Bayesian predictions. a) Posterior probability distribution over parameters. b) Samples from posterior probability distribution correspond to normal distributions. c) The predictive distribution for the Bayesian case is the average of an infinite set of samples. Alternately, we can think of choosing the parameters from a uniform distribution and computing a weighted average where the weights correspond to the posterior distribution.

Note that the posterior (left-hand side of Equation 4.20) must be a valid probability distribution and sum to one, so the constant κ from the conjugate product and the denominator from the right-hand side must exactly cancel to give

images

Now we see the major advantage of using a conjugate prior: we are guaranteed a closed form expression for the posterior distribution over the parameters.

This posterior distribution represents the relative plausibility of various parameter settings μ and σ2 having created the data. At the peak of the distribution is the MAP estimate, but there are many other plausible configurations (Figure 4.6).

When data are plentiful (Figure 4.6a), the parameters are well specified, and the probability distribution is concentrated. In this case, placing all of the probability mass at the MAP estimate is a good approximation to the posterior. However, when data are scarce (Figure 4.6c), many possible parameters might have explained the data and the posterior is broad. In this case approximation with a point mass is inadequate.

Predictive density

For the maximum likelihood and MAP estimates, we evaluate the predictive density (probability that a new data point x* belongs to the same model) by simply evaluating the normal pdf with the estimated parameters. For the Bayesian case, we compute a weighted average of the predictions for each possible parameter set, where the weighting is given by the posterior distribution over parameters (Figures 4.6a–c and 4.7),

images

Here we have used the conjugate relation for a second time. The integral contains a constant with respect to μ and σ2 multiplied by a probability distribution. Taking the constant outside the integral, we get

images

images

Figure 4.8 a–c) Predictive densities for MAP and Bayesian approaches with 50, 5, and 1 training examples. As the training data decreases, the Bayesian prediction becomes less certain but the MAP prediction is erroneously overconfident. d–f) This effect is even more clear on a log scale.

which follows because the integral of a pdf is one. It can be shown that the constant is given by

images

where

images

Here, we see the second advantage of using the conjugate prior; it means that the integral can be computed, and so we get a nice closed form expression for the predictive density.

Figure 4.8 shows the predictive distribution for the Bayesian and MAP cases, for varying amounts of training data. With plenty of training data, they are quite similar but as the amount of data decreases, the Bayesian predictive distribution has a significantly longer tail. This is typical of Bayesian solutions: they are more moderate (less certain) in their predictions. In the MAP case, erroneously committing to a single estimate of μ and σ2 causes overconfidence in our future predictions.

images

Figure 4.9 a) Categorical probability distribution over six discrete values with parameters {λk}6k=1 where ∑6k=1 λk=1. This could be the relative probability of a biased die landing on its six sides. b) Fifteen observations {xi}Ii=1 randomly sampled from this distribution. We denote the number of times category k was observed by Nk so that here the total observations ∑6k=1 Nk = 15.

4.5 Worked example 2: Categorical distribution

As a second example, we consider discrete data {xi}Ii=1 where xi ∈ {1,2,…,6} (Figure 4.9). This could represent observed rolls of a die with unknown bias. We will describe the data using a categorical distribution (normalized histogram) where

images

For the ML and MAP techniques, we estimate the six parameters {λk}6k=1. For the Bayesian approach, we compute a probability distribution over the parameters.

4.5.1 Maximum Likelihood

To find the maximum likelihood solution, we maximize the product of the likelihoods for each individual data point with respect to the parameters λ1…6.

images

where Nk is the total number of times we observed bin k in the training data. As before, it is easier to maximize the log probability, and we use the criterion

images

where the second term uses the Lagrange multiplier ν to enforce the constraint on the parameters ∑6k=1λk = 1. We differentiate L with respect to λk and ν, set the derivatives equal to zero, and solve for λk to obtain

images

In other words, λk is the proportion of times that we observed bin k.

4.5.2 Maximum a posteriori

To find the maximum a posteriori solution we need to define a prior. We choose the Dirichlet distribution as it is conjugate to the categorical likelihood. This prior over the six categorical parameters is hard to visualize but samples can be drawn and examined (Figure 4.10a–e). The MAP solution is given by

images

which is again subject to the constraint that ∑6k=1 λk = 1. As in the maximum likelihood case, this constraint is enforced using a Lagrange multiplier. The MAP estimate of the parameters can be shown to be

images

where Nk is the number of times that observation k occurred in the training data. Note that if all the values αk are set to one, the prior is uniform and this expression reverts to the maximum likelihood solution (Equation 4.30).

4.5.3 Bayesian Approach

In the Bayesian approach, we calculate a posterior over the parameters

images

where imagesk = Nk + αk. We have again exploited the conjugate relationship to yield a posterior distribution with the same form as the prior. The constant κ must again cancel with the denominator to ensure a valid probability distribution on the left-hand side. Samples from this distribution are shown in Figure 4.10f–j.

Predictive Density

For the ML and MAP estimates we compute the predictive density (probability that a new data point x* belongs to the same model) by simply evaluating the categorical pdf with the estimated parameters. With the uniform prior (α1…6 = 1) the MAP and ML predictions are identical (Figure 4.11a) and both are exactly proportional to the frequencies of the observed data.

images

Figure 4.10 a–e) Five samples drawn from Dirichlet prior with hyperparameters α1…6 = 1. This defines a uniform prior, so each sample looks like a random unstructured probability distribution. f–j) Five samples from Dirichlet posterior. The distribution favors histograms where bin three is larger and bin four is small as suggested by the data.

For the Bayesian case, we compute a weighted average of the predictions for each possible parameter set, where the weighting is given by the posterior distribution over parameters so that

images

Here, we have again exploited the conjugate relationship to yield a constant multiplied by a probability distribution and the integral is simply the constant as the integral of the pdf is one. For this case, it can be shown that

images

This is illustrated in Figure 4.11b. It is notable that once more the Bayesian predictive density is less confident than the ML/MAP solutions. In particular, it does not allot zero probability to observing x* = 4 despite the fact that this value was never observed in the training data. This is sensible; just because we have not drawn a 4 in 15 observations does not imply that it is inconceivable that we will ever see one. We may have just been unlucky. The Bayesian approach takes this into account and allots this category a small amount of probability.

images

Figure 4.11 Predictive distributions with α1…6 = 1 for a) maximum likelihood / maximum a posteriori approaches and b) Bayesian approach. The ML/MAP approaches predict the same distribution that exactly follows the data frequencies. The Bayesian approach predicts a more moderate distribution and allots some probability to the case x = 4 despite having seen no training examples in this category.

Summary

We presented three ways to fit a probability distribution to data and to predict the probability of new points. Of the three methods discussed, the Bayesian approach is the most desirable. Here it is not necessary to find a point estimate of the (uncertain) parameters, and so errors are not introduced because this point estimate is inaccurate.

However, the Bayesian approach is only tractable when we have a conjugate prior, which makes it easy to calculate the posterior distribution over the parameters Pr(θ|x1…I) and also to evaluate the integral in the predictive density. When this is not the case, we will usually have to rely on maximum a posteriori estimates. Maximum likelihood estimates can be thought of as a special case of maximum a posteriori estimates in which the prior is uninformative.

Notes

For more information about the Bayesian approach to fitting distributions consult Chapter 3 of Gelman et al. (2004). More information about Bayesian model selection (Problem 4.6), including an impassioned argument for its superiority as a method of hypothesis testing can be found in Mackay (2003).

Problems

4.1 Show that the maximum likelihood solution for the variance σ2 of the normal distribution is given by

informalequation

4.2 Show that the MAP solution for the mean μ and variance σ2 of the normal distribution are given by

informalequation

when we use the conjugate normal-scaled inverse gamma prior

informalequation

4.3 Taking Equation 4.29 as a starting point, show that the maximum likelihood parameters for the categorical distribution are given by

informalequation

where Nk is the number of times that category K was observed in the training data.

4.4 Show that the MAP estimate for the parameters {λ}Kk=1 of the categorical distribution is given by

informalequation

under the assumption of a Dirichlet prior with hyperparameters {αk}Kk=1.The terms Nk again indicate the number of times that category k was observed in the training data.

4.5 The denominator of Bayes’ rule

informalequation

is known as the evidence. It is a measure of how well the distribution fits regardless of the particular values of the parameters. Find an expression for the evidence term for (i) the normal distribution and (ii) the categorical distribution assuming conjugate priors in each case.

4.6 The evidence term can be used to compare models. Consider two sets of data inlinefigure1 = { 0.1,−0.5,0.2,0.7} and inlinefigure2 = {1.1,2.0,1.4,2.3}. Let us pose the question of whether these two data sets came from the same normal distribution or from two different normal distributions.

Let model M1 denote the case where all of the data comes from the one normal distribution. The evidence for this model is

informalequation

where θ = {μσ2} contains the parameters of this normal distribution. Similarly, we will let M2 denote the case where the two sets of data belong to different normal distributions

informalequation

where θ1 = {μ1σ21} and θ2 = {μ2σ22}.

Now it is possible to compare the probability of the data under each of these two models using Bayes’ rule

informalequation

Use this expression to compute the posterior probability that the two datasets came from the same underlying normal distribution. You may assume normal-scaled inverse gamma priors over θθ1, and θ2 with parameters α = 1, β = 1, γ = 1, δ = 0.

Note that this is (roughly) a Bayesian version of the two-sample t-test, but it is much neater – we get a posterior probability distribution over the two hypotheses rather than the potentially misleading p value of the t-test. The process of comparing evidence terms in this way is known as Bayesian model selection or the evidence framework. It is rather clever in that two normal distributions fitted with maximum likelihood will always explain the data better than one; the additional parameters simply make the model more flexible. However because we have marginalized these parameters away here, it is valid to compare these models in the Bayesian case.

4.7 In the Bernoulli distribution, the likelihood Pr(x1…I|λ) of the data {xi}Ii=1 given parameter λ where xi ∈ {0, 1} is

informalequation

Find an expression for the maximum likelihood estimate of the parameter λ.

4.8 Find an expression for the MAP estimate of the Bernoulli parameter λ (see Problem 4.7) assuming a beta distributed prior

informalequation

4.9 Now consider the Bayesian approach to fitting Bernoulli data, using a beta distributed prior. Find expressions for (i) the posterior probability distribution over the Bernoulli parameters given observed data {xi}Ii=1 and (ii) the predictive distribution for new data x*.

4.10 Staying with the Bernoulli distribution, consider observing data 0,0,0,0 from four trials. Assuming a uniform beta prior (α = 1, β = 1), compute the predictive distribution using the (i) maximum likelihood, (ii) maximum a posteriori, and (iii) Bayesian approaches. Comment on the results.

___________________________________

1 Here we adopt the notation θ to represent a generic set of parameters when we have not specified the particular probability model.