So, let's summarize what we learned so far.

We discussed Markov Chain Monte Carlo methods,

and the Monte Carlo part is about approximating expected values by sampling.

And the main question here is how to sample.

How can we generate samples from

a complicated distribution which we may know up to normalization constant,

which usually happens in probabilistic modeling?

To do that we discussed two main methods.

Both of them are coming from the Markov Chain Monte Carlo family,

and the idea of Markov Chain Monte Carlo is to build a dynamic system,

such that if you simulate it long enough,

it's state start to look like samples from a desired distribution.

In this dynamic system called Markov Chain,

we discussed two ways to build a Markov Chain that

converges to your distribution you want to sample from.

The first method here is Gibbs sampling,

which reduces the problem of sampling from

multidimensional distribution to a sequence of one dimensional samplings.

And one dimensional samplings are usually much easier

because we've first of all some efficient methods

for one dimensional sampling cyclic chain all purpose methods like rejection sampling.

Second of all, sometimes you can even compute the normalization constant,

because integrating things in one mention is much easier.

And finally, sometimes you can even find

analytical solutions to this one dimensional sampling problem.

If your original multidimensional sampling is structured enough.

But we discussed some downsides to this Gibbs sampling,

is that, is generate highly correlated updates.

And this means that it converge slowly and that you have to use lots

and lots of samples to approximate your expected value accurately enough.

The second method we discussed here is Metropolis Hastings,

and it gives you a whole family of Markov Chains,

which are all converged to the desired distribution you want,

and it kind of gives you more freedom,

you can choose, right?

But with this freedom comes great responsibility. You have to choose wisely.

If you choose this Markov Chain to be not well suited for your purpose like,

if you choose Markov Chain that converge slowly or has again highly correlated samples,

then you will waste lots of resources and

the methods available will not be very efficient.

So let's put this Markov Chain Monte Carlo methods in the context.

Let's compare it to the variational inference we covered in the previous week.

Again the Monte Carlo is about approximating the expected values.

So substituting them with average with

respect to samples you obtained from your distribution.

And the main property here is that this approximation is unbiased.

So, the more you wait,

the more accuracy you get,

and you can get this accuracy as low as you want if you're willing to wait long enough.

If you want to express this property a little bit more mathematically,

you can write down like this.

So, this average, this approximation is a random variable itself, right?

So if you repeat this process of obtaining samples and

then averaging them to get an approximation,

you'll get another approximation.

So to run a variable,

[inaudible] different approximations on different samples.

And an estimate like this is called unbiased,

if its expected value is equal to the thing you want to approximate.

So, on average your approximation is correct.

And if you wait long enough this approximation will converge to the right answer.

And this is something that is not true for Variational Inference.

So, in Variational Inference,

we're trying to approximate the distribution we

want to sample from or we want to work with,

with distribution from some simple family,

like for example of family of factorised distributions.

And of course in iterations, Variational Inference will become better and better.

Because so you think better and better approximation of

your distribution in this class of factorised distributions.

But you can't get your error to zero,

because you can never approximate

your complicated distribution arbitrarily well with a factorized one.

So, a typical picture for this describing distribution is like this.

You have your problem and you are trying to solve it with Markov Chain Monte Carlo,

and it kind of works and gives you a solution which is

better and better and better with time and it approaching to zero error.

Then you use Variational Inference,

it usually gives you a nice solution much faster but it stagnates.

It can't get past some critical value fairer and can get you to zero.

So, to put this Markov Chain Monte Carlo even more in the context,

we can recall the table we had in the end of week three about Variational Inference.

The table of approximate method store solve for

probabilistic modeling problems we have covered so far.

So first of all, there is Full Bayesian Inference,

where you model all your parameters and latent variables as latent variables.

So, you don't train anything in the usual sense.

You're not trying to find the best fitters for parameters,

but instead you're trying to marginalize how they'll think you don't know.

This one theoritically get the best results.

But on most for practical models,

this Full Bayesian Inferences isn't feasible,

and you have to do some approximations.

And one of them is to do a midfield or variational inference.

Where you assume that your posterior distribution is factorized,

and you're trying to approximate

your posterior distribution in these factorized family as well as you can.

Now we have just added one more method here.

We can approximate this Full Bayesian inference

by sampling from the posterior distribution,

and using these samples to obtain,

to estimate the values you care about.

So, we can use Gibbs sampling or Metropolis Hastings here and assemble a few points

from the posterior distribution so far from

the all latent variables and then use them to approximate.

Then if you can do in that or if you don't want to because it processes slow,

you can use the expectation maximization algorithm.

Which basically tries to find the maximum likelihood estimation on the parameters theta,

so it doesn't treat theta as latent variable,

but instead it treats theta as parameters,

and is trying to find the best theta for these parameters by using some iterative scheme.

And this expectation maximization algorithm is also sometimes intractable.

So, sometimes it's very hard to do E or M step.

And in these cases you can do again midfield

variational inference with applied to expectation maximization.

So basically on the E step,

you can approximate your posterior,

your variational distribution cue with a factorized one,

or again, you can do Markov Chain Monte Carlo for EM.

So on the EM step,

instead of working with

a distribution cue directly

and trying to find expected value with respect to this distribution,

you can acquire samples from this posterior on the latent variables,

and then use the samples to approximate the expected value on the EM step,

and maximize this approximation instead of the original expected value.

As of the final summary,

Markov Chain Monte Carlo is a method that allows you

to do training or inferencing probabilistic models,

and it's really easy to implement.

It's really easy to parallelize at least in terms of like if you have 100 computers,

you can run 100 independent cue centers for example on each computer,

and then combine the samples obtained from all these servers.

And finally, it gives you unbiased estimates.

So in principle, if you wait long enough you can get as low accuracy as you may want.

On the negative side,

sometimes this Markov Chain Monte Carlo methods are slower than the others.

But because they're so widely applicable,

sometimes is just the only choice.

So, in the next videos we will cover

a few practical examples of how can you apply Markov Chain Monte Carlo to your problems.