In this video, we'll discuss a little bit of

details about the M-step or maximization step.

Recall that on the previous video,

we derived variational lower bound for our marginal log likelihood.

So, we wanted to maximize marginal log likelihood,

but instead we are going to maximize this lower bound.

And, on the M-step, we want to maximize

this lower bound with respect to theta or parameters.

And the lower bound looks like this.

So, it's some with respect to the objects in the data set,

then some with respect to the values of the late variable G,

and then some weighted logarithms of the ratios.

Now, we can use the property of the logarithm,

that logarithm of the ratio is difference between logarithms,

and we'll get something like this.

And note that this second term in this expression,

it is constant with respect to theta,

so it doesn't depend on theta at all.

Which means that then we can start

concerned about maximizing this value with respect to theta.

We can throw this term away and nothing will change.

So we can just ignore this and it will not change our M-step,

which means that we can maximize the first term instead of the full lower bound.

And the first term you can rewrite by using the following connotation,

just an expected value of

the Joint Distribution P of X and T with respect to the variational distribution Q.

One more thing to notice here is that

this function which we are trying to maximize

with respect to theta on the M-step is concave.

Maybe not always, but usually you choose your P of X given T and P of T yourself,

so you can choose this function to be such that the [inaudible] would fit is a concave function.

Which means that it's easy to maximize and note that

expect that the value of the concave function is still a concave function.

So, for example in the Gaussian [inaudible] case,

this joint probability pure factor T is Gaussian and logarithm of Gaussian is

just some linear function with respect to parameters minus-

at least you can prove that

logarithm of the Gaussian is concave for respect to parameters.

Which means that first of all,

this function has just unique global optimum.

Well, at least all local maximas are the same as

global maxima and maybe really easy to maximize.

And usually you can even derive the analytical expression of this M-step because

this program is so much easier than the original one in most cases.

So, to summarize the expectation maximization algorithm

repeats the following two steps in iterations until convergence.

So, first of all, the E-step where we are trying to minimize

the KL distance between

the variation distribution Q and

the posterior distribution P of T U in the data and the parameters.

And this is the same as just setting Q to be the posterior distribution itself.

And then the M-step,

where we're maximizing the expected value of

the logarithm of the joint distribution with respect to parameters theta.

So, now let's say a few words about the convergence properties of this EM algorithm.

It turns out that we can prove that it converges to

a local maximum or maybe at least to a critical point.

So, to see it, let's look at the following picture.

On the previous situation theta K,

we have chosen among all the family of lower bounds.

We have chosen the best lower bound at the current point,

which is accurate at the current point.

So, on the next M-step,

we went to the maximum of this lower bound,

of this red curve, to get theta K plus one.

Let's look how the marginal log likelihood

changed after switching from theta K to theta K plus one.

So, marginal log likelihood at the point theta K plus one is greater than or equal to

the lower bound at the point theta K plus one just because its lower bound,

so it should always hold for any point, right?

Then we can say that the value of

this lower bound at the current point, the theta K plus one,

is greater than or equal to the value of the lower bound at the previous point,

just because theta K plus one is the maximum of the lower bound with respect to theta.

So it's greater than or equal to the value of the lower bound at any other point, right?

And finally, because we have chosen Q to be such that the lower bound is optimal in

the point theta K then this lower bound at the point theta K

just equals to the log likelihood at the point theta K. So,

finally, we obtained such an equation that tells us that

the marginal log likelihood never decreases during an iteration of the EM algorithm.

Which basically means that the EM algorithm never make things worse.

And so there are two outcomes of this observation.

First of all, it's a really useful debugging tool.

So if you ever see in your implementation of the EM,

that the marginal log likelihood decreases,

then you have a bug and you have to fix it somehow.

And second of all, by using this kind of considerations,

you can prove that the EM algorithm always converge.

It may be a local maximum,

it maybe some other critical point,

but at least it's not worse than the gradient descent in this respect.

So, it's always converses at some critical point.