0:00

One of the most generally useful class of sampling methods one that's very commonly

used in practice is the class of Markov Chain Monte Carlo methods.

And those are methods that allows us to design an intuitive sampling process that

through a sequence of steps allows us to generate a sample from a desired target

distribution that might be intractable to sample from directly.

So what are these Markov chains and how do you use them.

So, a Markov Chain is a, is a method for

sampling from a distribution p that is intractable to sample from.

And so we see one example of such a distribution p.

if you'd like to sample from a from say Bayesian network.

Where we have some evidence. We don't really know how to sample from

that. If you'd like to sample from a Markov

network, we don't really know how to sample from that either, in general.

And so here we have examples of distributions p.

And we'd like to come up with a way of generating even one sample from that

distribution p. So mark up chain gives us a general

mechanism for doing that. And what the Markov Chain does is it

defines an iterative process by which the first sample that you generate is not

going to be from a distribution p but ultimately, as you move along,

you are going to get closer and closer to generating a sample from p.

So, so let's understand what that what that means so we have a Markov chain, and

a Markov chain is defined over a state space which we are going to use x's to

denote. And so here is an example of such a state

space. This is a very simple state space, it

starts with the zero point over here and you can imagine, and it has, I, the, four

negitive numbers to the left and four positive numbers to the left.

And a Markov chain defines a probabilistic transition model which,

given that I'm at a given state, x tells me how likely I am to transition to a

different state, x prime. And that is a probability distribution as

indicated in this formula here. So that's for any x, the probability, the

sum over the probability over the states you can transition is exactly one.

So for example if in this case we have our little grass, grasshopper that starts

out at state zero. We can see that it has a probability of

0.25 of transitioning to the right. A probability of 0.25 of transitioning to

the left. And a probability of 0.5 of not making

any progress and staying exactly where it is.

And in fact that same general probability distribution is actually in this example

replicated across the different states in the chain, with the exception of the

states at the end, where if the poor grasshopper tries to go to the left when

it's at negative four, it's going to hit the wall and it's going to end up staying

where it is regardless. And so the probability in this case of

staying is actually 0.75, corresponding to the two different cases that we just

talked about. But anyway, this is a Markov chain, and

it and you can imagine. Simulating a random process by which a

grasshopper traverses this chain. And so, initially, it starts out with at

say, at state zero. And then

and then, what happens, as. And then, as it moves, it selects to move

left with probability of quarter. Right with probability of quarter.

And and stay at the same place with probability 0.5.

Once the states move to the left, it now does exactly the same thing.

So let's think about the temporal dynamics of this type of process.

We can ask what is the probability that a time, that it's step t + 1 so this is the

step. what is the probability that at that

time-step the state, at time t1, + 1 is equal to some value x fine.

So we can get that by a recurrence relationship that looks at the state of

time T. So if we had previously computed the

probability distribution over where the grasshopper might be at time T.

We can sum up over all possible states where x, where the grasshopper might be

and ask if it was at state x at time t, what is the probability that it ends up

at being that it ends up going to x prime.

So this together gives me a distribution over pairs.

X comma X prime, where this, which, which measures the propability that at time, t

grasshopper is at state X and the T plus one is at X prime, and since I'm only

interested now in asking about T plus one, I sum up, or marginalize the time T

step, state X. So if we go back to our grasshopper

example, we can simulate this process. And here is the first three steps of it.

So at time zero, the grasshopper is at state zero with a probability of one.

At time one, it's going to be at negative one with a probability of a quarter and

it's positive one with a probability of a quarter, probability half it's stuck at

the same state. And now we can simulate the next step.

So, it's, at the next time step, the probability that it moves, manages to

move, all the way to -2 is 0.25 squared, because it considers two successful moves

to the left. Here you have two successful moves to the

right. At stage zero you have, for example, a

sum of different events, which is the probability that it stayed in the same

state twice. So.5 squared.

Plus the two events that correspond to moving to the left once and back to the

right, or moving to the right and back to the left, each of which happens with

probability 25^2.. So this is an example of how you would do

this. Now it turns out from many of these

chains and we'll describe conditions in a moment ultimately as the process evolves,

the probability distribution kind of equalizes which means that the

probability for the time t your at state x prime is almost the same as the

probability that at time t + one you're at x prime.

So sort of it kind of the distribution over where you might be tends to

equalize. And and so, we can then consider what's

called the limiting process, the limiting distribution that you would get as you

simulate the process for more and more steps.

And that is typically denoted by pie, which is called the stationary

distribution. It also has a bunch of other names.

But, stationary is the most common. And if you plugin.

You see, basically take out this approximate equality.

And you can see that what we have is a condition on Pi.

Is that the distribution of ti, at one time step is the needs to be at, the, the

probability of X prime in the stationary distribution needs to be equal to this

summation over here. Where Pi now appears both in the left

hand side and within the summation on the right hand side.

8:42

has to be equal to 0.25 times pi of x1, because of the self transition here plus

oops zero point, yes, sorry.

pi of x1 is equal to 0.25 times pi of x1 + 0.5 times pi of x3,

okay? Because there, if you were at pi that,

you can end up in x1 in one of two ways. Either by starting out on x1 and staying

there, or by starting out at x3 and moving to x1 which happens with

probability 0.5. Similarly, pi is X two, you can end up

either by being at x2 and staying there, which is this transition or by being at

x3 and moving to x2 which happens with probability 0.5.

And so this is a set of three simultaneous equations and three

variables. this by itself is an underdetermined

system because you could multiply all of the pis by a factor of 100 and it would

still be a solution. But we can add to that the one constraint

that says that all of the pis have to be equal.

The sum of all of the pis has to be one, which is because it's a probability

distribution. And once you do that you end up with a

unique solution to the system of linear equations.

And it's not difficult to plug those pis into the system and confirm that indeed

this is the stationary distribution that satisfies these equations.

By the way, for the grasshopper example that we showed on the previous slide.

The stationary distribution is the uniformed distribution.

so this has to be one of the dumbest way of generated a sample from the uniform

distribution. But it does in fact generate eventually a

sample from something that's very closely uniformed distribution.

So, when does a Markov Chain converge into a stationary distribution, and I

said many of them do but it turns out that not all of them, and so a condition

that guarantees convergence to a stationary distribution is something

called regularity, so a Markov Chain is regular if the following condition holds.

And notice the order of the quantifiers. If there exists found number k, integer k

such that for every pair of states. So this is the universal quantifier.

The probability of getting from x to x prime in exactly k steps is greater zero.

Now, notice what that means. It means that you pick the K first.

And only, so you can't have a different K for different pairs of states.

You can pick K to be 1000, a million. It doesn't matter for this purpose.

But you need to pick it first and then there needs to be a way of getting from X

to x primes in exactly that number of steps, with probability greater than

zero. It turns out that is a sufficient not a

necessary but a sufficient condition to guarantee that the Markov chain converges

to a unique stationary distribution regardless of a starts date.

So it converges and it converges to a single stationary distribution that's

characterized by the equations that we saw on the previous line.

12:10

Now what are some sufficient conditions for regularity because this one is a

little bit hard to check. and so one sufficient condition for

regularity that people often use in practice is first that every pair of

states, x and x prime need to be connected with a path of probability

greater than one. Sorry with a probability greater than

zero. And, for every state there's a self

transition. So you can always transition from a state

to a self. So if you think about that, that means

that if you take K to be the diameter of this graph.

So if you set K to be the the distance between the furthest.

Pair of states. Then, you can get, between every pair of

states, using exactly k steps. Because you can take less than k, and

then just stick around for a while until you hit k, because of the self

transitions. So this is a way of guaranteeing that,

That, for this value k, you can get from every state to every state.

And that's what people typically do. They typically add these self transitions

to guarantee regularity. So to summarize, we've defined a notion

of a mark up chain. Which defines a general dynamical system,

or an iterative sampling process, from which we can sample trajectories that

traverse the space of the chain. Under certain conditions, such as the

regularity condition that we defined. Which is sufficient, although not

necessary. This iterative sampling process is

guaranteed to converge to a stationary distribution at at the limit.

That is, after we generate enough samples.

And so this provides us with a general approach to sample from a distribution

indirectly. Which means that if we have a

distribution from which it's intractable to sample this provides us with an

alternative mechanism.