0:00

[MUSIC]

We finally have all the tools we may need to build the general form of

the expectation maximization algorithm, so

let's start with the formulation of the program.

Say we have a latent variable model, so

we have latent variables t which are not observed, which are latent.

And variables x which we observe, and

we have marginal likelihood p of x given parameters.

Which can be expressed in terms of the model, the likelihood,

and the parameter, as summing the full joint distribution p of x and

t given theta, with respect to all the values of t.

So marginalizing out t, and we can write down this full

joint likelihood as a product of the p of x and t and the prior p of t.

So this is how the marginal likelihood looks like, and we would like to maximize.

So our problem here is to maximize the likelihood of our data set.

The density of the data set, the parameters with respect to parameters.

And this is marginal likelihood because we don't have zero latent variables,

so we have to marginalize the amount.

First step, as it usually happens in these kind of problems,

let's take the logarithm of our likelihood.

Because it will churn products into summations, and

make life easier because of that.

Next step is to assume that our dataset consists of n objects,

which are independent given the parameters,

which is kind of standard in all machine learning problems.

And finally,

we can churn product into summation by using the property of the logarithm.

And we have a function like this,

which we want to maximize with respect to parameters, right?

One more step, we can substitute the marginal likelihood of

the data object xi by its definition,

which is sum of the joint distribution respect to the values of ti.

[COUGH] And finally, we have a function like this which you want to maximize.

And as we have discussed in the previous module, on the Gauss [INAUDIBLE] model

case, it can be in principle maximized with stochastic gradient descent.

But sometimes it's not the optimal choice to do,

so we're going to do something else, something better.

Well, the general idea would be as follows, let's build a lower bound for

this thing by using the Jensen's Inequality.

I don't give you details here, they will come in the next few slides, but

let's assume that we can lower down this thing

by some function which is easy to optimize.

In this case, instead of maximizing this original margin of likelihood,

we can maximize its lower bound instead, and we can get the picture like this.

So this is our margin of likelihood, let's imagine that it looks like this,

and here we assume that it has just one parameter theta.

So the parameters are one dimensional, which in practice is usually not the case,

so you can have millions of parameters.

But it's much easier for demonstration purposes, so

it's much easier to plot some graphs in one dimension.

And the marginal likelihood has the optimal parameter theta star, which we

would like to obtain, but as we have already discussed, it's going to be hard.

So we cannot hope to obtain this theta star in reasonable time,

so we'll be happy to find at least a local maximum of this function.

Now we can lower bound this function, so we can find a function l,

which is less than or equal to our module of likelihood at any given point,

and it will build it in such a way that it's easy to maximize.

So we can find its maximum theta star, and use that maximum as kind of a proxy for

the maximum of the original likelihood of the blue curve.

Well, it's kind of reasonable, but

it turns out that actually we don't have any guarantees.

The theta hat, the optional of the lower bound,

can be arbitrarily [INAUDIBLE] to the curve.

Here, for example, the theta hat almost appears

in the local minimum of the local likelihood,

and it's really bad, right.

Like, we wanted to maximize the blue curve, and we found something

which we'd like to use as its approximate maximum, it's a local minimum.

So we have to do something better, have to do something else, well,

let's revise this idea of lower-bounding.

Instead of one lower bound, let's try to build a family of lower bounds.

And then we'll be able to choose among these lower bounds, the one that suits

us best at this current iteration or moment, or whatever, so how to do it?

Well, let's introduce some weights,

some lever which we'll be able to change, to change our lower bound.

So let's introduce some some any q on the latent variable t we want,

and let's multiply and divide our function by this q.

It doesn't change anything, right, because we just like multiplied this thing by 1.

But now we can treat this q as weights in the Jensen's inequality.

So we can treat this q as alphas, and we can treat the joint

distribution v of x and t, joined by q, as the points as v.

And now we can apply Jensen's inequality for

the logarithm to this function to build a lower bound.

Which will look like this, and notice that it now depends on q, right, so

we kind of succeed.

We built a lower bond, which depends both on the theta along q,

which we can now change to obtain different lower bounds.

To choose the lower bound we would like to use on this current iteration,

or whatever we have.

So now the picture looks like this, we have, again,

our marginal likelihood of the blue curve.

And we also have a family of lower bounds, like the orange, the red, and

the green one, and we can choose among them to find the best one.

For example, we can choose the red one and maximize it with respect to theta.

And its theta hat will be a reasonably good approximation of the maximum

of the blue curve.

It will be somewhere in a good position, so

how can we use this, this family of lower bounds, in practice?

Well, let's use this kind of iterative approach, so

let's start with any point theta k.

So the point from the previous iteration, or

maybe from initialization, and let's try to prove it in one iteration.

So we have a family of lower bounds, let's, among this family,

find the lower bound which is maximum at this current point theta k.

Which means, let's find q such that the value of a lower

bound l of the point theta k and q is maximum.

And ideally, if we are able to maximize this thing to the optimum,

we'll get something like this.

Because this red curve, the red lower bound,

cannot be higher than the blue curve.

So the best possible thing we could hope for is that they attach,

that the lower bond becomes accurate at the current point theta k.

And now we can go the maximum of this lower bound,

so to the point theta k plus 1, which is maximum of the red curve.

And we hope that the lower bound is easier to maximize, so

it can easily do the steps and iterations.

And also, you can interpret the same thing we have just

discussed in kind of sense manner.

So on the first sub step, we're fixing theta k, and

we're maximizing our lower bound to respect of q with respect to distribution.

On the next half step, we fix this best q and

we maximize our lower bound with respect to theta.

So we fix one, we optimize with respect to the rest, that makes sense, right?

Because we have lower bound that depends on theta and q, so

let's find its maximum with respect to both sets of parameters.

And how to do it, well, why not to use block coordinate intercept,

why not to do iterations like fix one, optimize the rest?

And also note that this thing, this l, this lower bound,

is sometimes called variational lower bound, because you can vary it.

It depends on some parameter q, which we didn't have in the original model.

You can do another step of this iteration, so now we're at the point theta k plus 1.

And we can also find the best lower bound at this particular point, and

hopefully we'll find a lower bound which touches the blue curve at this point.

Then we can go to the maximum of this red curve, so

to obtain the next point theta k plus 2.

So to summarize, expectation maximization suggests you, so

first of all, we built a lower bound on the local likelihood.

Which depends both on the theta which one to maximize, the blue curve,

the local likelihood, and the parameter q which is something we just introduced.

And it's also sometimes called the variational parameters, and

it suggests you to optimize this lower bound in iterations by repeating the two

steps until convergence.

On the E-step, fix theta and maximize with respect to q,

maximize the lower bound with respect to q.

And on the M-step, fix q and maximize the lower bound with respect of theta.

So this is the general view of the expectation maximization.

In the next videos, we will discuss how to obtain these E and M steps efficiently.

[MUSIC]