We describe the expectation maximization algorithm and gave the formulas for how to use it but we didn't talk about how it might perform in practice so let's talk about that a little bit. So first let's look at how EM behaves as a function of the number of iterations. So over here on the x axis we see the number of iterations going from zero to 50 and on the y axis we have the log likelihood of the average training instances so averaged over all the training instances. Which is the objective that the algorithm is trying to optimize. Remember, EM is the maximum likelihood optimization algorithm. So what we're showing here is effectively how the objective improves and so the first observation that we have is that as we effectively stated as a theorem. So this shows that the theorem is true. the EM, the log likelihood increases consistently, monotonically, over iterations, t. So that's good. So that shows our theorem is correct. But let's understand what happens now as the algorithm continues to evolve. We see that in the earlier iterations, so up to about, I don't know, iteration eight or ten in this context, the algorithm, the log like exhibits a very sharp improvement. But then it asymptote and so we might think that when we get to the ration tenerso right around here maybe there's no point continuing to run because there's no more increase and we've achieved convergence and we might as well stop. Turns out that's not quite right. What we see here is for the same model with the same graph, the x axis iterations and the y axis is the value of different parameters in the model. And over here we see the same iteration 10 and we see that although the log likelihood no longer continues to improve after duration 10 as we can see on the left hand graph, various parameter values actually exhibit extremely sharp changes after duration ten. specifically the blue parameter goes down significantly, the black parameter goes up significantly. And so and so we see that the convergence in the likelihood space which is what we have on the left is not the same as convergence in parameter space which means that if we're trying to gauge convergence, we probably ought to look at convergence on parameter space. However, it's also important, to realize that carry EM through the convergence is not necessarily the best ideal. So what we see here, is the same graph of the training log likelihood per instance, but on the right hand side, we look, we can the, log likelihood per instance, on test instances, that is instances that the algorithm was not trained on, and what we see, is that, while the log likelihood of testt instances also increases sharply. Up to about iteration ten. After that, it starts to decrease gradually and so, in fact, as the algorithm is continuing to modify its parameters and slightly improve the training log likelihood, the test log likelihood in this particular case actually goes down. So this is an example of over fitting. In this case it's numerical over fitting to the specific data instances that we have, or the noise in those data instances. And in order to avoid that there's two main strategies that people adopt. The first is to use a validation set, or cross-validation, to determine the time T at which we should stop doing iterations and so we basically measure m-log likelihood on the test set and say okay, at this point we can see that it's starting to go down and so we should just stop doing iterations. This is potentially a computationally costly procedure however because evaluating the test log likelihood needs that we need to run inference. On the test sentences and that can be expensive. So a cheaper strategy is to use some kind of map assignment rather than maximum likelihood destination. With the same benefits that we saw when doing estimation for the complete data case. Basically a prior smooths out the estimates and makes them less susceptible to noise in the training data. So that's one aspect of the behavior of the expectation maximization algorithm. And by the way, the same phenomena will also happen if we use gradient ascent, so this is really orthogonal in many ways to the specifics of the optimization algorithm. Except the very sharp increase at the beginning which is very characteristic of the m. A very different problem that we already mentioned before is the fact that both EM and gradient ascent effectively achieve only a local optimum of the log likelihood function. So what we see over here is on the x axis the number of samples m that we're training on. So this is not iterations anymore this is the number of samples. and the y axis is the number of the distinct local optima that the algorithm found over 25 iterations. So sorry 25 applications of the m algorithm initialized at different random starting points. So what we see here is we have three lines. The blue line represents 25% of missing data. The red line is 50% missing data, so 50% of the entries at randomly, were, were omitted. And the black line is when you have a hidden variable. So in this case, the fraction of missing data is actually not perhaps as large, but there is one variable that we never get to observe in any of the instances. And what we see, in both the blue line, for the 25%, and the red line, for the 50%. Is that, initially, we have a very large number of local optima. In fact, initially, the pretty much every iteration of the m finds a distinct local optima. But then as we have more and more training instances. The algorithm, begins to be able to better identify, a global a, a smaller number of optima come to the forefront. The other ones kind of, become less dominant or disappear. Now the more missing data we have so in the red line. The slower this process to reduce the number of local optima. And when we have hidden variables, which is this black line over here, the local optima never go away. Now, you might say that fine, we have local optima, but maybe, we don't care. Maybe these local optima are all about the same and it doesn't really matter which one we land at. Unfortunately that doesn't, turn out to be the case so here we have on the x axis, sorry on the y axis, rather. The, log, particular log likelihood value, that, might be achieved, on, the training set, by By a run of expectation maximization. And then the Y axis, we have the percent of runs that achieve a particular log likelihood value. And we can see that there are That there are, huge variation, in the log likelihood values that are achieved by different runs of the expectation maximization algorithm. Which brings us to the point that initialization is absolutely critical to the performance of an algorithm such as expectation maximization. And there's different strategies that people adopt in order to try and get to a better local optimum. So first and perhaps. easiest implement in some sense is the use of multiple random research. So we randomly sample parameters for example and we initialize 'em at these different points and we run 'em to convergence and then we pick the run of 'em that has the highest training log like we'd have to all of the objective that we're trying to optimize. In some cases, we might have some amount of prior knowledge, that we might use to initialize the algorithm, for example, we can have some prior knowledge about the parameter, or if we're initializing on the E step, we might have some prior knowledge about the assignment say of, of variables to cluster, of, of instances to clusters for example, from some other source. Making them in graphic data for people or something like that. And finally a common and quite effective initialization strategy is to use a simpler algorithm rather than full scaling because it turns out that while EM very powerful that also makes it more susceptible to getting stop in certain local optimal where some simpler algorithms for example in the context of missing over the hidden variables who might use a simpler clustering algorithm. That you might have heard about in, in a different context. K means or, how to work with [INAUDIBLE] clustering. Which is a simpler algorithm. Not as powerful. But. But is also somewhat less susceptible to. Local optimum, and so we might initialize from that and, then run the M to get a better, output then these simple algorithms might give rise to. So to summarize we've seen that convergence of the likelihood function is not the same as convergence of the parameters and so we have to decide which one we actually care about. We've also seen that running the algorithm to convergence, all the way to convergence, can lead to numerical over-fitting. We talked about strategies to get, to avoid this problem. we talked about local optima, and the fact that they're unavoidable. And, furthermore, they increase with the amount of missing data. So the more missing data we have, the more local optima we have. And, furthermore, they decrease with the total amount of data, but not if we have missing, not if we hidden variables, at which point you're stuck with them no matter how much data you have. We also show that local optimum are not a fake problem. That is, which local optimum you land in can actually be a significant factor in terms of the performance, that the algorithm achieves in terms of log likelihood and, therefore, smart initialization turns out to be often critical for the performance of algorithms you'll learn with latent variables with and with missing data.