We describe the expectation maximization algorithm and gave the formulas for how

to use it but we didn't talk about how it might perform in practice so let's talk

about that a little bit. So first let's look at how EM behaves as

a function of the number of iterations. So over here on the x axis we see the

number of iterations going from zero to 50 and on the y axis we have the log

likelihood of the average training instances so averaged over all the

training instances. Which is the objective that the algorithm

is trying to optimize. Remember, EM is the maximum likelihood

optimization algorithm. So what we're showing here is effectively

how the objective improves and so the first observation that we have is that as

we effectively stated as a theorem. So this shows that the theorem is true.

the EM, the log likelihood increases

consistently, monotonically, over iterations, t.

So that's good. So that shows our theorem is correct.

But let's understand what happens now as the algorithm continues to evolve.

We see that in the earlier iterations, so up to about, I don't know, iteration

eight or ten in this context, the algorithm, the log like exhibits a very

sharp improvement. But then it asymptote and so we might

think that when we get to the ration tenerso right around here maybe there's

no point continuing to run because there's no more increase and we've

achieved convergence and we might as well stop.

Turns out that's not quite right. What we see here is for the same model

with the same graph, the x axis iterations and the y axis is the value of

different parameters in the model. And over here we see the same iteration

10 and we see that although the log likelihood no longer continues to improve

after duration 10 as we can see on the left hand graph, various parameter values

actually exhibit extremely sharp changes after duration ten.

specifically the blue parameter goes down significantly, the black parameter goes

up significantly.

And so and so we see that the convergence in the likelihood space which is what we

have on the left is not the same as convergence in parameter space which

means that if we're trying to gauge convergence, we probably ought to look at

convergence on parameter space. However, it's also important, to realize

that carry EM through the convergence is not necessarily the best ideal.

So what we see here, is the same graph of the training log likelihood per instance,

but on the right hand side, we look, we can the, log likelihood per instance, on

test instances, that is instances that the algorithm was not trained on, and

what we see, is that, while the log likelihood of testt instances also

increases sharply. Up to about iteration ten.

After that, it starts to decrease gradually and so, in fact, as the

algorithm is continuing to modify its parameters and slightly improve the

training log likelihood, the test log likelihood in this particular case

actually goes down. So this is an example of over fitting.

In this case it's numerical over fitting to the specific data instances that we

have, or the noise in those data instances.

And in order to avoid that there's two main strategies that people adopt.

The first is to use a validation set, or cross-validation, to determine the time T

at which we should stop doing iterations and so we basically measure m-log

likelihood on the test set and say okay, at this point we can see that it's

starting to go down and so we should just stop doing iterations.

This is potentially a computationally costly procedure however because

evaluating the test log likelihood needs that we need to run inference.

On the test sentences and that can be expensive.

So a cheaper strategy is to use some kind of map assignment rather than maximum

likelihood destination. With the same benefits that we saw when

doing estimation for the complete data case.

Basically a prior smooths out the estimates and makes them less susceptible

to noise in the training data. So that's one aspect of the behavior of

the expectation maximization algorithm. And by the way, the same phenomena will

also happen if we use gradient ascent, so this is really orthogonal in many ways to

the specifics of the optimization algorithm.

Except the very sharp increase at the beginning which is very characteristic of

the m. A very different problem that we already

mentioned before is the fact that both EM and gradient ascent effectively achieve

only a local optimum of the log likelihood function.

So what we see over here is on the x axis the number of samples m that we're

training on. So this is not iterations anymore this is

the number of samples. and the y axis is the number of the

distinct local optima that the algorithm found over 25 iterations.

So sorry 25 applications of the m algorithm initialized at different random

starting points. So what we see here is we have three

lines. The blue line represents 25% of missing

data. The red line is 50% missing data, so 50%

of the entries at randomly, were, were omitted.

And the black line is when you have a hidden variable.

So in this case, the fraction of missing data is actually not perhaps as large,

but there is one variable that we never get to observe in any of the instances.

And what we see, in both the blue line, for the 25%, and the red line, for the

50%. Is that, initially, we have a very large

number of local optima. In fact, initially, the pretty much every

iteration of the m finds a distinct local optima.

But then as we have more and more training instances.

The algorithm, begins to be able to better identify, a global

a, a smaller number of optima come to the forefront.

The other ones kind of, become less dominant or disappear.

Now the more missing data we have so in the red line.

The slower this process to reduce the number of local optima.

And when we have hidden variables, which is this black line over here, the local

optima never go away. Now, you might say that fine, we have

local optima, but maybe, we don't care. Maybe these local optima are all about

the same and it doesn't really matter which one we land at.

Unfortunately that doesn't, turn out to be the case so here we have on the x

axis, sorry on the y axis, rather. The, log, particular log likelihood

value, that, might be achieved, on, the training set, by

By a run of expectation maximization. And then the Y axis, we have the percent

of runs that achieve a particular log likelihood value.

And we can see that there are That there are, huge variation, in the

log likelihood values that are achieved by different runs of the expectation

maximization algorithm. Which brings us to the point that

initialization is absolutely critical to the performance of an algorithm such as

expectation maximization. And there's different strategies that

people adopt in order to try and get to a better local optimum.

So first and perhaps. easiest implement in some sense is the

use of multiple random research. So we randomly sample parameters for

example and we initialize 'em at these different points and we run 'em to

convergence and then we pick the run of 'em that has the highest training log

like we'd have to all of the objective that we're trying to optimize.

In some cases, we might have some amount of prior knowledge, that we might use to

initialize the algorithm, for example, we can have some prior knowledge about the

parameter, or if we're initializing on the E step, we might have some prior

knowledge about the assignment say of, of variables to cluster, of, of instances to

clusters for example, from some other source.

Making them in graphic data for people or something like that.

And finally a common and quite effective initialization strategy is to use a

simpler algorithm rather than full scaling because it turns out that while

EM very powerful that also makes it more susceptible to getting stop in certain

local optimal where some simpler algorithms for example in the context of

missing over the hidden variables who might use a simpler clustering algorithm.

That you might have heard about in, in a different context.

K means or, how to work with [INAUDIBLE] clustering.

Which is a simpler algorithm. Not as powerful.

But. But is also somewhat less susceptible to.

Local optimum, and so we might initialize from that and, then run the M to get a

better, output then these simple algorithms might give rise to.

So to summarize we've seen that convergence of the likelihood function is

not the same as convergence of the parameters and so we have to decide which

one we actually care about. We've also seen that running the

algorithm to convergence, all the way to convergence, can lead to numerical

over-fitting. We talked about strategies to get, to

avoid this problem. we talked about local optima, and the

fact that they're unavoidable. And, furthermore, they increase with the

amount of missing data. So the more missing data we have, the

more local optima we have. And, furthermore, they decrease with the

total amount of data, but not if we have missing, not if we hidden variables, at

which point you're stuck with them no matter how much data you have.

We also show that local optimum are not a fake problem.

That is, which local optimum you land in can actually be a significant factor in

terms of the performance, that the algorithm achieves in terms of log

likelihood and, therefore, smart initialization turns out to be often

critical for the performance of algorithms you'll learn with latent

variables with and with missing data.