0:00

[MUSIC]

Â We finally have all the tools we may need to build the general form of

Â the expectation maximization algorithm, so

Â let's start with the formulation of the program.

Â Say we have a latent variable model, so

Â we have latent variables t which are not observed, which are latent.

Â And variables x which we observe, and

Â we have marginal likelihood p of x given parameters.

Â Which can be expressed in terms of the model, the likelihood,

Â and the parameter, as summing the full joint distribution p of x and

Â t given theta, with respect to all the values of t.

Â So marginalizing out t, and we can write down this full

Â joint likelihood as a product of the p of x and t and the prior p of t.

Â So this is how the marginal likelihood looks like, and we would like to maximize.

Â So our problem here is to maximize the likelihood of our data set.

Â The density of the data set, the parameters with respect to parameters.

Â And this is marginal likelihood because we don't have zero latent variables,

Â so we have to marginalize the amount.

Â First step, as it usually happens in these kind of problems,

Â let's take the logarithm of our likelihood.

Â Because it will churn products into summations, and

Â make life easier because of that.

Â Next step is to assume that our dataset consists of n objects,

Â which are independent given the parameters,

Â which is kind of standard in all machine learning problems.

Â And finally,

Â we can churn product into summation by using the property of the logarithm.

Â And we have a function like this,

Â which we want to maximize with respect to parameters, right?

Â One more step, we can substitute the marginal likelihood of

Â the data object xi by its definition,

Â which is sum of the joint distribution respect to the values of ti.

Â [COUGH] And finally, we have a function like this which you want to maximize.

Â And as we have discussed in the previous module, on the Gauss [INAUDIBLE] model

Â case, it can be in principle maximized with stochastic gradient descent.

Â But sometimes it's not the optimal choice to do,

Â so we're going to do something else, something better.

Â Well, the general idea would be as follows, let's build a lower bound for

Â this thing by using the Jensen's Inequality.

Â I don't give you details here, they will come in the next few slides, but

Â let's assume that we can lower down this thing

Â by some function which is easy to optimize.

Â In this case, instead of maximizing this original margin of likelihood,

Â we can maximize its lower bound instead, and we can get the picture like this.

Â So this is our margin of likelihood, let's imagine that it looks like this,

Â and here we assume that it has just one parameter theta.

Â So the parameters are one dimensional, which in practice is usually not the case,

Â so you can have millions of parameters.

Â But it's much easier for demonstration purposes, so

Â it's much easier to plot some graphs in one dimension.

Â And the marginal likelihood has the optimal parameter theta star, which we

Â would like to obtain, but as we have already discussed, it's going to be hard.

Â So we cannot hope to obtain this theta star in reasonable time,

Â so we'll be happy to find at least a local maximum of this function.

Â Now we can lower bound this function, so we can find a function l,

Â which is less than or equal to our module of likelihood at any given point,

Â and it will build it in such a way that it's easy to maximize.

Â So we can find its maximum theta star, and use that maximum as kind of a proxy for

Â the maximum of the original likelihood of the blue curve.

Â Well, it's kind of reasonable, but

Â it turns out that actually we don't have any guarantees.

Â The theta hat, the optional of the lower bound,

Â can be arbitrarily [INAUDIBLE] to the curve.

Â Here, for example, the theta hat almost appears

Â in the local minimum of the local likelihood,

Â and it's really bad, right.

Â Like, we wanted to maximize the blue curve, and we found something

Â which we'd like to use as its approximate maximum, it's a local minimum.

Â So we have to do something better, have to do something else, well,

Â let's revise this idea of lower-bounding.

Â Instead of one lower bound, let's try to build a family of lower bounds.

Â And then we'll be able to choose among these lower bounds, the one that suits

Â us best at this current iteration or moment, or whatever, so how to do it?

Â Well, let's introduce some weights,

Â some lever which we'll be able to change, to change our lower bound.

Â So let's introduce some some any q on the latent variable t we want,

Â and let's multiply and divide our function by this q.

Â It doesn't change anything, right, because we just like multiplied this thing by 1.

Â But now we can treat this q as weights in the Jensen's inequality.

Â So we can treat this q as alphas, and we can treat the joint

Â distribution v of x and t, joined by q, as the points as v.

Â And now we can apply Jensen's inequality for

Â the logarithm to this function to build a lower bound.

Â Which will look like this, and notice that it now depends on q, right, so

Â we kind of succeed.

Â We built a lower bond, which depends both on the theta along q,

Â which we can now change to obtain different lower bounds.

Â To choose the lower bound we would like to use on this current iteration,

Â or whatever we have.

Â So now the picture looks like this, we have, again,

Â our marginal likelihood of the blue curve.

Â And we also have a family of lower bounds, like the orange, the red, and

Â the green one, and we can choose among them to find the best one.

Â For example, we can choose the red one and maximize it with respect to theta.

Â And its theta hat will be a reasonably good approximation of the maximum

Â of the blue curve.

Â It will be somewhere in a good position, so

Â how can we use this, this family of lower bounds, in practice?

Â Well, let's use this kind of iterative approach, so

Â let's start with any point theta k.

Â So the point from the previous iteration, or

Â maybe from initialization, and let's try to prove it in one iteration.

Â So we have a family of lower bounds, let's, among this family,

Â find the lower bound which is maximum at this current point theta k.

Â Which means, let's find q such that the value of a lower

Â bound l of the point theta k and q is maximum.

Â And ideally, if we are able to maximize this thing to the optimum,

Â we'll get something like this.

Â Because this red curve, the red lower bound,

Â cannot be higher than the blue curve.

Â So the best possible thing we could hope for is that they attach,

Â that the lower bond becomes accurate at the current point theta k.

Â And now we can go the maximum of this lower bound,

Â so to the point theta k plus 1, which is maximum of the red curve.

Â And we hope that the lower bound is easier to maximize, so

Â it can easily do the steps and iterations.

Â And also, you can interpret the same thing we have just

Â discussed in kind of sense manner.

Â So on the first sub step, we're fixing theta k, and

Â we're maximizing our lower bound to respect of q with respect to distribution.

Â On the next half step, we fix this best q and

Â we maximize our lower bound with respect to theta.

Â So we fix one, we optimize with respect to the rest, that makes sense, right?

Â Because we have lower bound that depends on theta and q, so

Â let's find its maximum with respect to both sets of parameters.

Â And how to do it, well, why not to use block coordinate intercept,

Â why not to do iterations like fix one, optimize the rest?

Â And also note that this thing, this l, this lower bound,

Â is sometimes called variational lower bound, because you can vary it.

Â It depends on some parameter q, which we didn't have in the original model.

Â You can do another step of this iteration, so now we're at the point theta k plus 1.

Â And we can also find the best lower bound at this particular point, and

Â hopefully we'll find a lower bound which touches the blue curve at this point.

Â Then we can go to the maximum of this red curve, so

Â to obtain the next point theta k plus 2.

Â So to summarize, expectation maximization suggests you, so

Â first of all, we built a lower bound on the local likelihood.

Â Which depends both on the theta which one to maximize, the blue curve,

Â the local likelihood, and the parameter q which is something we just introduced.

Â And it's also sometimes called the variational parameters, and

Â it suggests you to optimize this lower bound in iterations by repeating the two

Â steps until convergence.

Â On the E-step, fix theta and maximize with respect to q,

Â maximize the lower bound with respect to q.

Â And on the M-step, fix q and maximize the lower bound with respect of theta.

Â So this is the general view of the expectation maximization.

Â In the next videos, we will discuss how to obtain these E and M steps efficiently.

Â [MUSIC]

Â