Computer Vision: Models, Learning, and Inference (2012)

Part III

Connecting local models

The models in chapters 6–9 describe the relationship between a set of measurements and the world state. They work well when the measurements and the world state are both low dimensional. However, there are many situations where this is not the case, and these models are unsuitable.

For example, consider the semantic image labeling problem in which we wish to assign a label that denotes the object class to each pixel in the image. For example, in a road scene we might wish to label pixels as ‘road’, ‘sky’, ‘car’, ‘tree’, ‘building’ or ‘other’. For an image with N=10000 pixels, this means we need to build a model relating the 10000 measured RGB triples to 610000 possible world states. None of the models discussed so far can cope with this challenge: the number of parameters involved (and hence the amount of training data and the computational requirements of the learning and inference algorithms) is far beyond what current machines can handle.

One possible solution to this problem would be to build a set of independent local models: for example, we could build models that relate each pixel label separately to the nearby RGB data. However, this is not ideal as the image may be locally ambiguous. For example, a small blue image patch might result from a variety of semantically different classes: sky, water, a car door or a person’s clothing. In general, it is insufficient to build independent local models.

The solution to this problem is to build local models that are connected to one another. Consider again the semantic labeling example: given the whole image, we can see that when the image patch is blue and is found above trees and mountains and alongside similar patches across the top of the image, then the correct class is probably sky. Hence, to solve this problem, we still model the relationship between the label and its local image region, but we also connect these models so that nearby elements can help to disambiguate one another.

In chapter 10 we introduce the idea of conditional independence, which is a way of characterizing redundancies in the model (i.e., the lack of direct dependence between variables). We show how conditional independence relations can be visualized with graphical models. We distinguish between directed and undirected graphical models. In chapter 11, we discuss models in which the local units are combined together to form chains or trees. In chapter 12, we extend this to the case where they have more general connections.

Chapter 10

Graphical models

The previous chapters discussed models that relate the observed measurements to some aspect of the world that we wish to estimate. In each case, this relationship depended on a set of parameters and for each model we presented a learning algorithm that estimated these parameters.

Unfortunately, the utility of these models is limited because every element of the model depends on every other. For example, in generative models we model the joint probability of the observations and the world state. In many problems both of these quantities may be high-dimensional. Consequently, the number of parameters required to characterize their joint density accurately is very large. Discriminative models suffer from the same pathology: if every element of the world state depends on every element of the data, a large number of parameters will be required to characterize this relationship. In practice, the required amount of training data and the computational burden of learning and inference reach impractical levels.

The solution to this problem is to reduce the dependencies between variables in the model by identifying (or asserting) some degree of redundancy. To this end, we introduce the idea of conditional independence, which is a way of characterizing these redundancies. We then introduce graphical models which are graph-based representations of the conditional independence relations. We discuss two different types of graphical models – directed and undirected – and we consider the implications for learning, inference, and drawing samples.

This chapter does not develop specific models or discuss vision applications. The goal is to provide the theoretical background for the models in subsequent chapters. We will illustrate the ideas with probability distributions where the constituent variables are discrete; however, almost all of the ideas transfer directly to the continuous case.

10.1 Conditional independence

When we first discussed probability distributions, we introduced the notion of independence (Section 2.6). Two variables x1 and x2 are independent if their joint probability distribution factorizes as Pr(x1,x2)=Pr(x1)Pr(x2). In layman’s terms, one variable provides no information about the other if they are independent.

With more than two random variables, independence relations become more complex. The variable x1 is said to be conditionally independent of variable x3 given variable x2 when x1 and x3 are independent for fixed x2 (Figure 10.1). In mathematical terms, we have

imge

Figure 10.1 Conditional independence. a) Joint pdf of three discrete variables x1,x2,x3, which take 4, 3, and 2 possible values, respectively. All 24 probability values sum to one. b) Marginalizing, we see that variables x1 and x2 are dependent; the conditional distribution of x1 is different for different values of x2 (the elements in each row are not in the same proportions)and vice versa. c) Variables x1 and x3 are also dependent. d) Variables x2 and x3 are also dependent. e–g) However, x1 and x3 are conditionally independent given x2. For fixed x2x1 tells us nothing more about x3 and vice versa.

images

Note that conditional independence relations are always symmetric; if x1 is conditionally independent of x3 given x2, then it is also true that x3 is independent of x1 given x2.

Confusingly, the conditional independence of x1 and x3 given x2 does not mean that x1 and x3 are themselves independent. It merely implies that if we know variable x2, then x1 provides no further information about x3 and vice versa. One way that this can occur is in a chain of events: if event x1 causes event x2 and x2 causes x3, then the dependence of x3 on x1 might be entirely mediated by x2.

Now consider decomposing the joint probability distribution Pr(x1,x2,x3) into the product of conditional probabilities. When x1 is independent of x3 given x2, we find that

images

The conditional independence relation means that the probability distribution factorizes in a certain way (and is hence redundant). This redundancy implies that we can describe the distribution with fewer parameters and so working with models with large numbers of variables becomes more tractable.

Throughout this chapter, we will explore the relationship between factorization of the distribution and conditional independence relations. To this end, we will introduce graphical models. These are graph-based representations that make both the factorization and the conditional independence relations easy to establish. In this book we will consider two different types of graphical model – directed and undirected graphical models – each of which corresponds to a different type of factorization.

imge

Figure 10.2 Example 1. A directed graphical model has one node per term in the factorization of the joint probability distribution. A node xn with no incoming connections represents the term Pr(xn). A node xn with incoming connections xpa[n] represents the term Pr(xn|xpa[n]). Variable xn is conditionally independent of all of the others given its Markov blanket. This comprises its parents, its children, and other parents of its children. For example, the Markov blanket for variable x8 is indicated by the shaded region.

10.2 Directed graphical models

directed graphical model or Bayesian network represents the factorization of the joint probability distribution into a product of conditional distributions that take the form of a directed acyclic graph (DAG) so that

images

where {xn}Nn=1 represent the constituent variables of the joint distribution and the function pa[n] returns the indices of variables that are parents of variable xn.

We can visualize the factorization as a directed graphical model (Figure 10.2) by adding one node per random variable and drawing an arrow to each variable xn from each of its parents xpa[n]. This directed graphical model should never contain cycles. If it does, then the original factorization was not a valid probability distribution.

To retrieve the factorization from the graphical model, we introduce one factorization term per variable in the graph. If variable xn is independent of all others (has no parents), then we write Pr(xn). Otherwise, we write Pr(xn|xpa[n]) where the parents xpa[n] consist of the set of variables with arrows that point to xn.

10.2.1 Example 1

The graphical model in Figure 10.2 represents the factorization

images

The graphical model (or factorization) implies a set of independence and conditional independence relations between the variables. Some statements about these relations can be made based on a superficial look at the graph. First, if there is no directed path between two variables following the arrow directions and they have no common ancestors, then they are independent. So, variable x3 in Figure 10.2 is independent of all of the other variables, and variables x1 and x2 are independent of each other. Variables x4 and x5 are not independent as they share an ancestor. Second, any variable is conditionally independent of all the other variables given its parents, children, and the other parents of its children (its Markov blanket). So, for example, variable x8 in Figure 10.2 is conditionally independent of the remaining variables given those in the shaded area.

For vision applications, these rules are usually sufficient to gain an understanding of the main properties of a graphical model. However, occasionally we may wish to test whether one arbitrary set of nodes is independent of another given a third. This is not easily established by looking at the graph, but can be tested using the following criterion:

The variables in set images are conditionally independent of those in set images given set images if all routes from images to images are blocked. A route is blocked at a node if (i) this node is in images and the arrows meet head to tail or tail to tail or (ii) neither this node nor any of its descendants are in images and the arrows meet head to head.

See Koller and Friedman (2009) for more details of why this is the case.

10.2.2 Example 2

Figure 10.3 tells us that

images

In other words, this is the graphical model corresponding to the distribution in Figure 10.1.

If we condition on x2, the only route from x1 to x3 is blocked at x2 (the arrows meet head to tail here) and so x1 must be conditionally independent of x3 given x2. We could have reached the same conclusion by noticing that the Markov blanket for variable x1 is just variable x2.

In this case, it is easy to prove this conditional independence relation algebraically. Writing out the conditional probability of x1 given x2 and x3

images

imge

Figure 10.3 Example 2. Directed graphical model relating variables x1,x2,x3 from Figure 10.1. This model implies that the joint probability can be broken down as Pr(x1,x2,x3) = Pr(x1)Pr(x2|x1)Pr(x3|x2).

imge

Figure 10.4 Example 3. Graphical models for a) mixture of Gaussians b) t-distribution and c) factor analysis. A node (black circle) represents a random variable. In a graphical model a bullet • represents a variable whose value is considered to be fixed. Each variable may be repeated many times, and this is indicated by a plate (blue rectangle) where the number of copies is indicated in the lower right corner. For example, in a) there are I data examples {xi}Ii=1 and I hidden variables {hi}Ii=1. Similarly, there are K sets of parameters {μkk}Kk=1, but just one weight vector λ.

we see that the final expression does not depend on x3 and so we deduce that x1 is conditionally independent of x3 given x2 as required.

Notice that the factorized distribution is more efficient to represent than the full version. The original distribution Pr(x1,x2,x3) (Figure 10.1a) contains 4 × 3 × 2 = 24 entries. However, the terms Pr(x1), Pr(x2|x1), and Pr(x3|x2) contain 4, 4 × 3 = 12, and 3 × 2 = 6 entries, respectively, giving a total of 22 entries. In this case, this is not a dramatic reduction, but in more practical situations it would be. For example, if each variable took ten possible values, the full joint distribution would have 10×10×10 = 1000 values, but the factorized distribution would have only 10+100+100=210 values. For even larger systems, this can make a huge saving. One way to think about conditional independence relations is to consider them as redundancies in the full joint probability distribution.

10.2.3 Example 3

Finally, in Figure 10.4 we present graphical models for the mixture of Gaussians, t-distribution and factor analysis models from Chapter 7. These depictions immediately demonstrate that these models have very similar structures.

They also add several new features to the graphical representation. First, they include multidimensional variables. Second, they include variables that are considered as fixed and these are marked by a bullet •. We condition on the fixed variables, but do not define a probability distribution over them. Figure 10.4c depicts the factorization Pr(hi,xi) = Pr(hi)Pr(xi|hi,μ,Φ,).

Finally, we have also used plate notation. A plate is depicted as a rectangle with a number in the corner. It indicates that the quantities inside the rectangle should be repeated the given number of times. For example, in Figure 10.4cthere are I copies {xi,hi}Ii=1 of the variables x and h but only one set of parameters μ,Φ, and .

10.2.4 Summary

To summarize, we can think about the structure of the joint probability distribution in three ways. First, we can consider the way that the probability distribution factorizes. Second, we can examine the directed graphical model. Third, we can think about the conditional independence relations.

There is a one-to-one mapping between directed graphical models (acyclic directed graphs of conditional probability relations) and factorizations. However, the relationship between the graphical model (or factorization) and the conditional independence relations is more complicated. A directed graphical model (or its equivalent factorization) determines a set of conditional independence relations. However, as we shall see later in this chapter, there are some sets of conditional independence relations that cannot be represented by directed graphical models.

10.3 Undirected graphical models

In this section we introduce a second family of graphical models. Undirected graphical models represent probability distributions over variables {xn}Nn=1 that take the form of a product of potential functions images[x1… N] so that

images

where the potential function imagesc[x1…N ] always returns a positive number. Since the probability increases when imagesc[x1…N ] increases, each of these functions modulates the tendency for the variables x1…N to take certain values. The probability is greatest where all of the functions images1…C return high values. However, it should be emphasized that potential functions are not the same as conditional probabilities, and there is not usually a clear way to map from one to the other.

The term Z is known as the partition function and normalizes the product of these positive functions so that the total probability is one. In the discrete case, it would be computed as

images

For realistically sized systems, this sum will be intractable; we will not be able to compute Z and hence will only be able to compute the overall probability up to an unknown scale factor.

We can equivalently write Equation 10.7 as

images

where ψc[x1…N]= –log[imagesc[x1…N]]. When written in this form, the probability is referred to as a Gibbs distribution. The terms ψc[x1…N ] are functions that may return any real number and can be thought of as representing a cost for every combination of labels x1…N. As the cost increases, the probability decreases. The total cost ∑Cc=1 ψc[x1…N] is sometimes known as the energy, and the process of fitting the model (increasing the probability) is hence sometimes termed energy minimization.

When each potential function images[•] (or alternatively each cost function ψ[•]) addresses all of the variables x1…N, the undirected graphical model is known as a product of experts. However, in computer vision it is more common for each potential function to operate on a subset of the variables S ⊂ {xn}N n=1. These subsets are called cliques and it is the choice of these cliques that determines the conditional independence relations. Denoting the cth clique by Sc we can rewrite Equation 10.7 as

images

In other words, the probability distribution is factorized into a product of terms, each of which only depends on a subset of variables. In this situation, the model is sometimes referred to as a Markov random field.

To visualize the undirected graphical model, we draw one node per random variable. Then, for every clique Sc we draw a connection from every member variable xi ∈ Sc to every other member variable.

Moving in the opposite direction, we can take a graphical model and establish the underlying factorization using the following method. We add one term to the factorization per maximal clique (see Figure 10.6). A maximal clique is a fully connected subset of nodes (i.e., a subset where every node is connected to every other) where it is not possible to add another node and remain fully connected.

It is much easier to establish the conditional independence relations from an undirected graphical model than for directed graphical models. They can be found using the following property:

One set of nodes is conditionally independent of another given a third if the third set separates them (prevents a path from the first node to the second).

It follows that a node is conditionally independent of all other nodes given its set of immediate neighbors, and so these neighbors form the Markov blanket.

10.3.1 Example 1

Consider the graphical model in Figure 10.5. This represents the factorization

images

We can immediately see that variable x1 is conditionally independent of variable x3 given x2 because x2 separates the other two variables: it blocks the path from x1 to x3. In this case, the conditional independence relation is easy to prove:

imge

Figure 10.5 Example 1. Undirected graphical model relating variables x1x2, and x3. This model implies that the joint probability can be factorized as Pr(x1,x2,x3) = images images1[x1,x2]images2[x2,x3].

images

Figure 10.6 Example 2. Undirected graphical model representing variables {xi}5i=1.The associated probability distribution factorizes into a product of one potential function per maximal clique. The clique S45 = {x4,x5} is a maximal clique as there is no other node that we can add that connects to every node in the clique. The clique S23 = {x2,x3} is not a maximal clique as it is possible to add node x1, and all three nodes in the new clique are connected to each other.

images

The final expression does not depend on x3 and so we conclude that x1 is conditionally independent of x3 given x2.

10.3.2 Example 2

Consider the graphical model in Figure 10.6. There are four maximal cliques in this graph, and so it represents the factorization

images

We can deduce various conditional independence relations from the graphical representation. For example, variable x1 is conditionally independent of variables x4 and x5 given x2 and x3, and variable x5 is independent of variables x1 and x2 given x3 and x4, and so on.

Note also that the factorization

images

creates the same graphical model: there is a many-to-one mapping from factorizations to undirected graphical models (as opposed to the one-to-one mapping for directed graphical models). When we compute a factorization from the graphical model based on the maximal cliques we do so in a conservative way. It is possible that there are further redundancies which were not made explicit by the undirected graphical model.

images

Figure 10.7 Directed versus undirected graphical models. a) Directed graphical model with three nodes. There is only one conditional independence relation implied by this model: the node x3 is the Markov blanket of node x2 (shaded area) and so x2 ╨ x1|x3, where the notation ╨ can be read as “is independent of”. b) This undirected graphical model implies the same conditional independence relation. c) Second directed graphical model. The relation x2 ╨ x1|x3 is no longer true, but x1 and x2 are independent if we don’t condition on x3 so we can write x2 ╨ x1. There is no undirected graphical model with three variables that has this pattern of independence and conditional independence.

10.4 Comparing directed and undirected graphical models

In Sections 10.2 and 10.3 we have discussed directed and undirected graphical models, respectively. Each graphical model represents a factorization of the probability distribution. We have presented methods to extract the conditional independence relations from each type of graphical model. The purpose of this section is to argue that these representations are not equivalent. There are patterns of conditional independence that can be represented by directed graphical models but not undirected graphical models and vice versa.

Figures 10.7a–b show an undirected and directed graphical model that do represent the same conditional independence relations. However, Figure 10.7c shows a directed graphical model for which there is no equivalent undirected graphical model. There is simply no way to induce the same pattern of independence and conditional independence with an undirected graphical model.

Conversely, Figure 10.8a shows an undirected graphical model that induces a pattern of conditional independence relations that cannot be replicated by any directed graphical model. Figure 10.8b shows a directed graphical model that is close, but still not equivalent; the Markov blanket of x2 is different in each model and so are its conditional independence relations.

We conclude from this brief argument that directed and undirected graphical models do not represent the same subset of independence and conditional independence relations, and so we cannot eliminate one or the other from our consideration. In fact, there are other patterns of conditional independence that cannot be represented by either type of model. However, these will not be considered in this book. For further information concerning the families of distributions that can be represented by different types of graphical model, consult Barber (2012) or Koller and Friedman (2009).

10.5 Graphical models in computer vision

We will now introduce a number of common vision models and look at their associated graphical models. We will discuss each of these in detail in subsequent chapters. However, it is instructive to see them together.

images

Figure 10.8. Directed versus undirected models. a) This undirected graphical model induces two conditional independence relations. However, there is no equivalent directed graphical model that produces the same pattern. b) This directed graphical model also induces two conditional independence relations, but they are not the same. In both cases, the shaded region represents the Markov blanket of variable x2.

Figure 10.9a shows the graphical model for a hidden Markov model or HMM. We observe a sequence of measurements {xn}Nn=1 each of which tells us something about the corresponding discrete world state {wn}Nn=1. Adjacent world states are connected together so that the previous world state influences the current one and potentially resolves situations where the measurements are ambiguous. A prototypical application would be tracking sequences of sign language gestures (Figure 10.9b). There is information at each frame about which gesture is present, but it may be ambiguous. However, we can impose prior knowledge that certain signs are more likely to follow others using the HMM and get an improved result.

Figure 10.9c represents a Markov tree. Again we observe a number of measurements, each of which provides information about the associated discrete world state. However, the world states are now connected in a tree structure. A prototypical application would be human body fitting (Figure 10.9d) where each unknown world state represents a body part. The parts of the body naturally have a tree structure and so it makes sense to build a model that exploits this.

Figure 10.9e illustrates the use of a Markov random field or MRF as a prior. The MRF here describes the world state as a grid of undirected connections. Each node might correspond to a pixel. There is also a measurement variable associated with each world state variable. These pairs are connected with directed links, so overall this is a mixed model (partly directed and partly undirected). A prototypical application of an MRF in vision would be for semantic labeling (Figure 10.9f). The measurements constitute the RGB values at each position. The world state at each pixel is a discrete variable that determines the class of object present (i.e., cow versus grass). The Markov random field prior ties together all of the individual classifiers to help yield a solution that makes global sense.

Finally, Figure 10.9 shows the Kalman filter. This has the same graphical model as the hidden Markov model but in this case the world state is continuous rather than discrete. A prototypical application of the Kalman filter is for tracking objects through a time sequence (Figure 10.9h). At each time, we might want to know the 2D position, size, and orientation of the hand. However, in a given frame the measurements might be poor: the frame may be blurred or the object may be temporarily occluded. By building a model that connects the estimates from adjacent frames, we can increase the robustness to these factors; earlier frames can resolve the uncertainty in the current ambiguous frame.

images

Figure 10.9 Commonly used graphical models in computer vision. a) Hidden Markov model. b) One possible application of the HMM is interpreting sign language sequences. The choice of sign at time n depends on the sign at time n – 1. c) Markov tree. d) An example application is fitting a tree-structured body model e) Markov random field prior with independent observations. f) The MRF is often used as a prior distribution in semantic labeling tasks. Here the goal is to infer a binary label at each pixel determining whether it belongs to the cow or the grass. g) Kalman filter. An example application is tracking an object through a sequence. It has the same graphical model as the HMM, but the unknown quantities are continuous as opposed to discrete.

Notice that all of these graphical models have directed links from the world w to the data x that indicate a relationship of the form Pr(x|w). Hence, they all construct a probability distribution over the data and are generative models. We will also consider discriminative models, but, historically speaking, generative models of this kind have been more important. Each model is quite sparsely connected: each data variable x connects only to one world state variable w, and each world state variable connects to only a few others. The result of this is that there are many conditional independence relations in the model. We will exploit these redundancies to develop efficient algorithms for learning and inference.

We will return to all of these models later in the book. We investigate the hidden Markov model and the Markov tree in Chapter 11. We discuss the Markov random field in Chapter 12, and we will present the Kalman filter in Chapter 19. The remaining part of this chapter answers two questions: (i) How can we perform inference when there are a large number of unknown world variables? (ii) What are the implications of using a directed graphical model versus an undirected one?

10.6 Inference in models with many unknowns

We will now consider inference in these models. Ideally, we would compute the full posterior distribution Pr(w1…N|x1…N using Bayes’ rule. However, the unknown world states in the preceding models are generally much larger than previously considered in this book and this makes inference challenging.

For example, consider the space of world states in the HMM example. If we are given 1000 frames of video and there are 500 common signs in the sign language, then there are 5001000 possible states. It is clearly not practical to compute and store the posterior probability associated with each. Even when the world states are continuous, computing and storing the parameters of a high-dimensional probability model is still problematic. Fortunately, there are alternative approaches to inference, which we now consider in turn.

10.6.1 Finding the MAP solution

One obvious possibility is to find the maximum a posteriori solution:

images

This is still far from trivial. The number of world states is extremely large so we cannot possibly explore every one and take the maximum. We must employ intelligent and efficient algorithms that exploit the redundancies in the distribution to find the correct solution where possible. However, as we shall see, for some models there is no known polynomial algorithm to find the MAP solution.

10.6.2 Finding the marginal posterior distribution

An alternative strategy is to find the marginal posterior distributions:

images

Since each of these distributions is over a single label, it is not implausible to compute and store each one separately. Obviously it is not possible to do this by directly computing the (extremely large) joint distribution and marginalizing it directly. We must use algorithms that exploit the conditional independence relations in the distribution to efficiently compute the marginals.

10.6.3 Maximum marginals

If we want a single estimate of the world state, we could return the maximum values of the marginal distributions, giving the criterion

images

This produces estimates of each world state that are individually most probable, but which may not reflect the joint statistics. For example, world state wn = 4 might be the most probable value for the nth world state and wm = 6 might be the most probable value for the mth world state, but it could be that the posterior probability for this configuration is zero: although the states are individually probable, they never co-occur (Figure 10.10).

10.6.4 Sampling the posterior

For some models, it is intractable to compute either the MAP solution or the marginal distributions. One possibility in this circumstance is to draw samples from the posterior distribution. Methods based on sampling the posterior would fall under the more general category of approximate inference as they do not normally return the true answer.

images

Figure 10.10 MAP solution versus max marginals solution. The main figure shows the joint posterior distribution Pr(w1,w2|x1,x2). The MAP solution is at the peak of this distribution at w1 = 2,w2 = 4 (highlighted in green). The figure also shows the two marginal distributions Pr(w1|x1,x2) and Pr(w2|x1,x2). The maximum marginals solution is computed by individually finding the maximum of each marginal distributions, which gives the solution w1 = 4, w2 = 2 (highlighted in red). For this distribution, this is very unrepresentative; although these labels are individually likely, they rarely co-occur and the joint posterior for this combination has low probability.

Having drawn a number of samples from the posterior, we can approximate the posterior probability distribution as a mixture of delta functions where there is one delta function at each of the sample positions. Alternatively, we could make estimates of marginal statistics such as the mean and variance based on the sampled values or select the sample with the highest posterior probability as an estimate of the MAP state; this latter approach has the advantage of being consistent with the full posterior distribution (as opposed to maximum marginals which is not) even if we cannot be sure that we have the correct answer.

An alternative way to compute a point estimate from a set of samples from the posterior is to compute the empirical max-marginals. We estimate the marginal probability distributions by looking at the marginal statistics of the samples. In other words, we consider one variable wn at a time and look at the distribution of different values observed. For a discrete distribution, this information is captured in a histogram. For a continuous distribution, we could fit a univariate model such as a normal distribution to these values to summarize them.

10.7 Drawing samples

We have seen that some of the approaches to inference require us to draw samples from the posterior distribution. We will now discuss how to do this for both directed and undirected models and we will see that this is generally more straightforward in directed models.

10.7.1 Sampling from directed graphical models

Directed graphical models take the form of directed acyclic graphs of conditional probability relations that have the following algebraic form:

images

It is relatively easy to sample from a directed graphical model using a technique known as ancestral sampling. The idea is to sample each variable in the network in turn, where the order is such that all parents of a node are sampled before the node itself. At each node, we condition on the observed sample values of the parents.

The simplest way to understand this is with an example. Consider the directed graphical model in Figure 10.11 whose probability distribution factorizes as

images

To sample from this model we first identify x1 as a node with no parents and draw a sample from the distribution Pr(x1). Let us say the observed sample at x1 took value α1.

We now turn to the remaining nodes. Node x2 is the only node in the network where all of the parents have been processed, and so we turn our attention here next. We draw a sample from the distribution Pr(x2|x1α1) to yield a sample α2. We now see that we are not yet ready to sample from x3 as not all of its parents have been sampled, but we can sample x4 from the distribution Pr(x4|x1=α1,x2=α2) to yield the value α4. Continuing this process we draw x3from Pr(x3|x2=α2,x4=α4) and finally x5 from Pr(x5|x3=α3).

images

Figure 10.11 Ancestral sampling. We work our way through the graph in an order (red number) that guarantees that the parents of every node are visited before the node itself. At each step we draw a sample conditioned on the values of the samples at the parents. This is guaranteed to produce a valid sample from the full joint distribution.

The resulting vector w* = [α1,α2,α3,α4,α5] is guaranteed to be a valid sample from the full joint distribution Pr(x1,x2,x3,x4,x5). An equivalent way to think about this algorithm is to consider it as working through the terms in the factorized joint distribution (right-hand side of Equation 10.18) sampling from each in turn conditioned on the previous values.

10.7.2 Sampling from undirected graphical models

Unfortunately, it is much harder to draw samples from undirected models except in certain special cases (e.g., where the variables are continuous and Gaussian or where the graph structure takes the form of a tree). In general graphs, we cannot use ancestral sampling because (i) there is no sense in which any variable is a parent to any other so we don’t know which order to sample in and (ii) the terms images[•] in the factorization are not probability distributions anyway.

One way to generate samples from any complex high-dimensional probability distribution is to use a Markov chain Monte Carlo (MCMC) method. The principle is to generate a series (chain) of samples from the distribution, so that each sample depends directly on the previous one (hence “Markov”). However, the generation of the sample is not completely deterministic (hence “Monte Carlo”).

One of the simplest MCMC methods is Gibbs sampling, which proceeds as follows. First, we randomly choose the initial state x[0] using any method. We generate the next sample in the chain x[1] by updating the state at each dimension {xn}Nn=1 in turn (in any order). To update the nth dimension xn we fix the other N – 1 dimensions and draw from the conditional distribution Pr(xn|x1…N\n) where the set x1…N\n denotes all of the N variables x1,x2… xNexcept xn. Having modified every dimension in this way, we obtain the second sample in the chain. This idea is illustrated in Figure 10.2 for the multivariate normal distribution.

When this procedure is repeated a very large number of times, so that the initial conditions are forgotten, a sample from this sequence can be considered as a draw from the distribution Pr(x1…N). Although this is not immediately obvious (and a proof is beyond the scope of this book), this procedure does clearly have some sensible properties: since we are sampling from the conditional probability distribution at each pixel, we are more likely to change the current value to one which has an overall higher probability. However, the stochastic update rule provides the possibility of (infrequently) visiting less probable regions of the space.

images

Figure 10.12 Gibbs sampling. We generate a chain of samples by cycling through each dimension in turn and drawing a sample from the conditional distribution of that dimension given that the others are fixed. a) For this 2D multivariate normal distribution, we start at a random position x[0]. We alternately draw samples from the conditional distribution of the first dimension keeping the second fixed (horizontal changes) and the second dimension keeping the first fixed (vertical changes). For the multivariate normal, these conditional distributions are themselves normal (Section 5.5). Each time we cycle through both of the dimensions, we create a new sample x[t]. b) Many samples generated using this method.

For undirected graphical models, the conditional distribution Pr(xn|x1…N\n) can be quite efficient to evaluate because of the conditional independence properties: variable xn is conditionally independent of the rest of the nodes given its immediate neighbors, and so computing this term only involves the immediate neighbors. However, overall this method is extremely inefficient: it requires a large amount of computational effort to generate even a single sample. Sampling from directed graphical models is far easier.

10.8 Learning

In Section 10.7 we argued that sampling from directed graphical models is considerably easier than sampling from undirected graphical models. In this section, we consider learning in each type of model and come to a similar conclusion. Note that we are not discussing the learning of the graph structure here; we are talking about learning the parameters of the model itself. For directed graphical models, these parameters would determine the conditional distributions Pr(xn|xpa[n]) and for undirected graphical models they would determine the potential functions imagesc[x1…N].

10.8.1 Learning in directed graphical models

Any directed graphical model can be written in the factorized form

images

where the conditional probability relations form a directed acyclic graph, and θ denotes the parameters of the model. For example, in the discrete distributions that we have focused on in this chapter, an individual conditional model might be

images

where the parameters here are {λk}Kk=1. In general, the parameters can be learned using the maximum likelihood method by finding

images

images

where xi,n represents the nth dimension of the ith training example. This criterion leads to simple learning algorithms, and often the maximum likelihood parameters can be computed in closed form.

10.8.2 Learning in undirected graphical models

An undirected graphical model is written as

images

where x=[x1,x2,…, xN] and we have assumed that the training samples are independent. However, in this form we must constrain the parameters so that they ensure that each imagesc[•] always returns a positive number. A more practical approach is to reparameterize the undirected graphical model in the form of the Gibbs distribution,

images

so that we do not have to worry about constraints on the parameters.

Given I training examples {xi}Ii=1, we aim to fit parameters θ. Assuming that the training examples are independent, the maximum likelihood solution is

images

where as usual we have taken the log to simplify the expression.

To maximize this expression we calculate the derivative of the log likelihood L with respect to the parameters θ:

images

The second term is readily computable but the first term involves an intractable sum over all possible values of the variable x: we cannot compute the derivative with respect to the parameters for reasonable-sized models and so learning is difficult. Moreover, we cannot evaluate the original probability expression (Equation 10.23) as this too contains an intractable sum. Consequently, we can’t compute the derivative using finite differences either.

In short, we can neither find an algebraic solution nor use a straightforward optimization technique as we cannot compute the gradient. The best that we can do is to approximate the gradient.

Contrastive divergence

One possible solution to this problem is the contrastive divergence algorithm. This is a method for approximating the gradient of the log likelihood with respect to parameters θ for functions with the general form,

images

where Z(θ) = ∑x f[x,θ] is the normalizing constant and the derivative of the log likelihood is

images

The main idea behind contrastive divergence follows from some algebraic manipulation of the first term:

images

images

Figure 10.13 The contrastive divergence algorithm changes the parameters so that the unnormalized distribution increases at the observed data points (blue crosses) but decreases at sampled data points from the model (green stars). These two components counterbalance one another and ensure that the likelihood increases. When the model fits the data, these two forces will cancel out, and the parameters will remain constant.

where we have used the relation ∂ log f[x]/∂x=(∂f[x]/∂x)/f[x] between the third and fourth lines.

The final term in Equation 10.29 is the expectation of the derivative of the derivative of log [f[x,θ]]. We cannot compute this exactly, but we can approximate it by drawing J independent samples x* from the distribution to yield

images

With I training examples {xi}Ii=1, the gradient of the log likelihood L is hence

images

A visual explanation of this expression is presented in Figure 10.13. The gradient points in a direction that (i) increases the logarithm of the unnormalized function at the data points xi but (ii) decreases the same quantity in places where the model believes the density is high (i.e., the samples xj*). When the model fits the data, these two forces cancel out, and the parameters will stop changing.

This algorithm requires us to draw samples x* from the model at each iteration of the optimization procedure in order to compute the gradient. Unfortunately, the only way to draw samples from general undirected graphical models is to use costly Markov chain Monte Carlo methods such as Gibbs sampling (Section 10.7.2), and this is impractically time consuming. In practice it has been found that even approximate samples will do: one method is to restart J = I samples at the data points at each iteration and do just a few MCMC steps. Surprisingly, this works well even with a single step. A second approach is to start with the samples from the previous iteration and perform a few MCMC steps so that the samples are free to wander without restarting. This technique is known as persistent contrastive divergence.

images

Figure 10.14 a) Graphical model for Problem 10.2. b) Graphical model for Problem 10.4

Discussion

In this chapter, we introduced directed and undirected graphical models. Each represents a different type of factorization of the joint distribution. A graphical model implies a set of independence and conditional independence relations. There are some sets that can only be represented by directed graphical models, others that can only be represented by undirected graphical models, some that can be represented by both, and some that cannot be represented by either.

We presented a number of common vision models and examined their graphical models. Each had sparse connections and hence many conditional independence relations. In subsequent chapters, we will exploit these redundancies to develop efficient learning and inference algorithms. Since the world state in these models is usually very high dimensional, we discussed alternative forms of inference including maximum marginals and sampling.

Finally, we looked at the implications of choosing directed or undirected graphical models for sampling and for learning. We concluded that it is generally more straightforward to draw samples from directed graphical models. Moreover, it is also easier to learn directed graphical models. The best-known learning algorithm for general undirected graphical models requires us to draw samples, which is itself challenging.

Notes

Graphical models: For a readable introduction to graphical models, consult Jordan (2004) or Bishop (2006). For a more comprehensive overview, I would recommend Barber (2012). For an even more encyclopaedic resource, consult koller and Friedman (2009).

Contrastive divergence: The contrastive divergence algorithm was introduced by Hinton (2002). Further information about this technique can be found in carreira-Perpiñán. and Hinton (2005) and Bengio and Delalleau (2009).

Problems

10.1The joint probability model between variables {xi}7i=1 factorizes as

images

Draw a directed graphical model relating these variables. Which variables form the Markov blanket of variable x2?

10.2 Write out the factorization corresponding to the directed graphical model in Figure 10.14a.

10.3 An undirected graphical model has the form

images

Draw the undirected graphical model that corresponds to this factorization.

10.4 Write out the factorization corresponding to the undirected graphical model in Figure 10.14b.

10.5 Consider the undirected graphical model defined over binary values {xi}4i=1 ∈ {0, 1} defined by

images

where the function images is defined by

images

Compute the probability of each of the 16 possible states of this system.

10.6 what is the Markov blanket for each of the variables in Figures 10.7 and 10.8?

10.7 Show that the stated patterns of independence and conditional independence in Figures 10.7 and 10.8 are true.

10.8 A factor graph is a third type of graphical model that depicts the factorization of a joint probability. As usual it contains a single node per variable, but it also contains one node per factor (usually indicated by a solid square). Each factor variable is connected to all of the variables that are contained in the associated term in the factorization by undirected links. For example, the factor node corresponding to the term Pr(x1|x2,x3) in a directed model would connect to all three variables x1x2, and x3. Similarly, the factor node corresponding to the term images12[x1,x2] in an undirected model would connect variables x1 and x2. Figure 10.15 shows two examples of factor graphs.

images

Figure 10.15 Factor graphs contain one node (square) per factor in the joint pdf as well as one node (circle) per variable. Each factor node is connected to all of the variables that belong to that factor. This type of graphical model can distinguish between the undirected graphical models a) Pr(x1,x2,x3)=imagesimages123[x1,x2,x3] and b) Pr(x1,x2.x3)=images images12[x1,x2]images23[x2,x3]images13[x1,x3].

Draw the factor graphs corresponding to the graphical models in Figures 10.7 and 10.8. You must first establish the factorized joint distribution associated with each graph.

10.9 What is the Markov blanket of variable w2 in Figure 10.9c?

10.10 What is the Markov blanket of variable w8 in Figure 10.9e?