And so that gives me, when you actually put it together as a general formula for

the acceptance probability, it gives us this expression,

which, if you look at it, you will see that it gives the right values for both

xx prime and for x prime to x. So that problem that we have to pick A

given Q and the question is what about Q? How do I pick Q?

Well, that turns out to be the $64 million question,

picking Q is an art, and it's not an easy choice in many

applications. But let's think about some conditions the

Q must satisfy before we think about how to pick one.

So, one condition on cue is derived simply by looking at the form of

expression for A. And we can see that this here involves a

ratio, and if you have ratio, then, one

condition is that you better not have a denominator that's equal to zero.

And, so one of the cases that one of the things we must guarantee regarding Q in

order for to be a valid set of acceptance probabilities is that Q itself must be

reversible in the following sense. That is if the if the denominator, if, if

the numerator is greater than zero then the denominator is greater than zero and

vice versa. That is, if you can propose a step from x

to x prime, you'd better be able to propose a step

from x prime to x with nonzero probability which guarantees that this

ratio is always well-defined. Now, the problem which constrains on Q is

that it actually induces a, a tension in the design of Q.

now on the one hand, you'd like Q to be very broad.

So if you were at this state over here, you'd like to, you'd like Q to sort of go

really far away from the current state, because the further away you can go, the

faster you move around in the state space.

You don't want to stay local. You don't want to make the same mistakes

that the Gibbs chain does by staying very near to the current assignment.

You want to move around quickly and that improves mixing.

On the other hand, if you look at the implications that this formula has, when

the proposal distribution is very broad, and so that you have one state that for

example, has low pi and the other state has a very high value of pi.

Which is what you get when you move around from a hilly part of the space to

a low part of the space, you have very big differences in terms of

the height of the two stationary. So what I'm drawing here is the height of

pi and this is my state space x, and if I move around very far, I'm liable

to get into situations where pi of one state x might be considerably larger than

pi of x prime. And if you think about what that implies

regarding the acceptance probability, the acceptance probability can get very, very

low, which in turn, implies slow mixing again,

because you might not, you might be taking very global steps but you're

taking very few of them. And so this is a real sort of tension in

the design of distributions, of proposal distributions queue.

We can always take an example of a of a Markov chain where proposal distributions

can make a very big difference. And, this is a problem that you wouldn't

necessarily immediately think of as a graphical model, but it can be formulated

as one and it's a matching problem and we'll see it in other settings as well.

So here we have a bunch of red dots as we see over here and we have a bunch of blue

dots, and we want to match the red dots and the blue dots and there's some sort

of preference function. So, these edges over here are annotated

I'm going to mark in green, are annotated with some kind of weights that tell us

how happy is the red, is this red dot to be matched with this blue dot.

And this happens a lot, for example, in in various correspondence problems where

we're trying to match sensor readings to objects for example, in a tracking

application. It also happens in in image problems

where you want to match features and one image that features another,

so it's a very common problem. So, one formulation of the matching

problem as a graphical model is that we have a variable Xi for each red dot.

So the green, so the red dots are going to be the variables,

the blue dots are going to represent values.

And we're going to have that Xi is equal to j if I match the i to j.

[COUGH] So that can now be written as a probablity distribution over a space.

And so, we can consider a probability of an assignment of values to to variables

and we're going to say that, first of all, has to be a matching.

So if we don't have that every x, every red dot is matched to a different blue

dot, that's probability zero assignment. So this would be the case if you had two

red dots that matched to the same blue dot. And otherwise, we're going to define

some kind of weighted combination of, and so, is, sorry, it's it's going to be an

exponential of the sum of the quality of the matches between the red dots and the

blue dots. So, we're summing up the quality of the

edges that participate in the matching. So for example, if this is my matching

over here and notice that I didn't get to match all the red dots,

that's well actually here. I'm going to match all the red dots.

So, now we have a matching on a, on the, for each red dot, we assigned the blue

dot as a value. And and now

the and now the overall quality is the total sum of of the quality of the

matching. And that defines, and that can be

exponentiated to define a probability distribution.

So and we're going to represent these qualities visually in this example as, as

distances, so that we're going to assume for example that this red dot prefers to

be matched to this blue dot more than it prefers to be matched to this further

away blue dot. So one could imagine running, it gives

sampling distribution on this model by taking a variable and say one of the red

dots and picking a new value for it, that is a new blue dot that it wasn't matched

to. This is going to change things very, very

slowly, because most of the blue dots are going

to be taken by a different red dot. And so, those assignments, where a red

dot is matched to a previously taken blue dot, are going to have very low

probability, zero probability in fact,

and so those are going to be impossible. So the only thing I can do is I can make

the red dot match a different blue dot match, match one of the unmatched blue

dots and that's going to take a very long time for things to change.

Here is a different way of doing this, which is the Metropolis-Hastings which is

one way of constructing a proposal distribution and using

Metropolis-Hastings. And that's called an augmenting path

proposal distribution. So what that does is,

let's assume that this is my current matching,

and I'm going to define the following proposal.

I'm going to pick one of the variables, Xi,

say this one. And I'm going to say, I'm going to pick

another value for it pretending that everything is available.

So I'm going to ignore the fact that these guys already matched. I'm going to

pick this one. I'm would I'm, I'm picking it

probabilistically, so I could have picked a different one.

I could have picked the same one that I was matched to before.

I could have picked a further away one, let's say I pick this one.

Well, now, this guy, poor guy, this one is unmatched.

this, sorry, this red guy over here, so he have to have to pick a new partner

and so, say it picks this one. This red guy is unmatched now and so it

might pick one and say it picks this one. And that leaves me with this red guy and

say, this red guy picks this one, and notice that now I have a new legal

matching. Every, I've closed the loop and I've

defined what's called an alternating cycle, where basically, I can switch all

of the black edges to green edges and that gives me a new legal matching and

that's my proposal. And with a little bit of a numerical

calculation, one can figure out the acceptance probability for this proposal

distribution and it turns out that it's actually really a good proposal

distribution. So let me show you some examples on the

next slide. So this is the results on the chain, if I

just apply Gibbs sampling. And you can see over here, four chains,

so the four colors represent the four chains.

And what you see on, as, in the x-axis is number of steps and on the y-axis is some

kind of statistic that I'm tracking in order to gauge whether the chain is

mixing. As it happens, this is the probability

that the first red dot is matched to the first blue dot,

but it doesn't matter, any statistic would have illustrated the

point here. And you can see that there's a lot of

long range fluctuations in the chain, that is the probabilities change over

time. The probabilities, by the way, are

computed over windows of size 500, from the samples in that window.

And you can see that the chain take, that the chain takes a long time to move from

one region of the space to the other. The proposal distribution that I just

showed you by contrast is totally different.

You can see that the probabilities are almost totally flat and that there is,

basically, all four chains are the same and there is, almost the same and there

is very little change in windows over time.

And there's an even better proposal distribution that I didn't show you that

modifies the augmenting path by just a little bit and that gives you effectively

perfect mixing across the entire time. If we look at the other metrics that we

defined for evaluating mixing, we see, we can compare the probabilities

of these different statistic of, of of different statistics across the two

chains. So, for example, this might be, just for

example, the red chain and this might be the green chain.

And what we see here are the are the statistics that you get for different

kinds of probabilities for the probability that this variable is

matched, I'm sorry, that this dot is matched to that dot for example.

And you can see that the Gibbs chains, the estimate that you get are very noisy

and the different chains give you divergent estimates.

The second Metropolis-Hastings, sorry, the first of the Metropolis-Hastings

gives you things that are almost on the diagonal, and here, things are

effectively exactly on the diagonal, perfect mixing.

But to summarize, Metropolis-Hastings is a very general framework for building

Markov chains, so that they are designed to have a particular stationary

distribution. It requires that you come up with a

proposal distribution and that proposal distribution has its needs to have

certain properties in order to be valid. But once you give me such a proposal

distribution, the detailed balance equation, which is this nice simple

equation, tells me how I can construct the acceptance distribution, so as to

it enforce the fact that we get the right stationary distribution.

So, this is great in some ways because it gives us this huge flexibility in

designing proposal distributions that explore the space quickly and move around

to far away parts of the space. But as we saw, picking the proposal

distribution is actually an important design point and it makes a big

difference to performance. And it's not, there isn't like a science

of how you should do this effectively. And so, there are and so this is

something that is really an art and requires nontrivial amount of thought and

engineering.