We previously defined the Or expectation maximization algorithm, which as it happens, is one of the algorithms that's most commonly used in practice because of its simplicity and because it's so effective in dealing with missing variables, missing values. But we didn't give much justification for why the algorithm works. So what we're going to do now is we're going to give at least a little bit of theoretical intuition as to how the formulas that we are using an Algorithm are arrived at. So let's start with providing a little bit of a more formal intuition. So, in some sense, Is also, like gradient ascent, a local search procedure but unlike gradient ascent that uses a less local approximation to the likelihood function. So specifically this is our likelihood function, as we had before, or rather the log likelihood really, in practice. What we're going to do is we're going to approximate the, we're going to take the current point that we're at, because remember this is a local iterative search. We're going to take that point and we're going to approximate the function at that point using this blue line over here. Now, gradient ascend approximated the function at this point using a straight line, a linear approximation. Approximates the function at this point using this nice curvy function, which, as it turns out is, as we'll see, a log likelihood itself. It's, in fact, the log likelihood with complete data. And having constructed this approximation, it now jumps to the maximum of this function. Which, because this is a log likelihood with complete data, can be maximized in closed form using maximum likely destination. So this is the intuition behind the. Let's see if we can formalize that a little bit. So first some notation, d is the observed data in a given instance, and we're going to do one instance for the moment. H is the hidden variables in that instance, and we're going to assume that somehow from somewhere, we have a distribution over the values of the hidden variables within that instance. So there's no distribution over the value, I mean, we're not going to impose the distribution over DV is observed. But for the hidden variables, we have some kind of distribution over the values that they might take. So now let's remind ourselves what the log likelihood function looks like when we have complete data. So the log likelihood function for the parameters theta given now a pair (d,h) where h is an assignment to big H for the hidden variables is simply by the definition of the log likelihood and the composition of the likelihood over the chain rule for Bayesian networks. It's a sum over the variables in the network, the sum over all possible assignments to known and its parents of an indicator function that says, in this particular pair d,h do the variables X sine ui take on this particular values little xi and little ui? And if so we're going to multiply in the log of the parameter corresponding to that entering the CPD. Now this is a really complicated way, you might think, to write what is effectively a sum of the log of theta of a node given its parents in this sine d,h, but there is a reason why we're going to write it this way and which will become clear in a moment. So now let's imagine that what we have is the distribution Q over H. And now we're going to look at the expectation of this log likelihood function relative to the distribution over H. So now rather than H being given, we have a distribution over H and we're going to look at the average or expectation of the log likelihood relative to the distribution Q sub H. Now importantly, because of the linearity of expectation, that expectation is going to push its way past the first summation, the sum over i, past the second summation which is the sum over assignments little xy and little ey. And we're going to end up with the following equation which is the expectation relative to Q of this indicator function that we had on the previous slide. And notice that we can also take log of theta xi given ui out of the expectation because that's fixed relative to h, which is exactly why we wrote the formula in this complicated form. So the one final step of this derivation is now to simplify this expectation over here, the expectation relative to Q. And this expectation, when you're taking an expectation of an indicator function, which is simply an event, in this case, it's the indicator function of the event, XI, equals little xi, and ui equals little ui. That expectation of an indicator function is simply the probability. So this entire expectation turns out to be equal to the probability of little xi, little ui under the distribution queue. And so we have now converted this expected log likelihood into the summation that utilizes probabilities relative to the distribution queue. So working forward from this formula, now let's consider the case where we have multiple data instances. So now we have a set of distributions, one for each data instance. And that also depends on the current parameter setting theta t so now we're in the midst of an Iteration. And we're going to use the current parameter setting theta t to define the probability distribution over the hidden variables. And that's going to be defined as the probability of the hidden variables given the observed variables and given the current setting of the parameter theta t. We now look at the sum of these expected log likelihoods. So the sum of, over all the data instances, little m = 1 to big M of the expectation of the hidden variables relative to this distribution that we just defined, Qt sub m of this log likelihood. We can now plug in the equation that we had up here, but substituting for P the definition of Q for what we see here, the P of H, given p of h given d in theta and that gives us this equation over here. The sum over i, sum over little xi comma little ui, all possible assignments to the family, some over the beta instances of the probability of xi comma ui given the data instances and the current parameters times the log of the current parameter value. Taking this one step further, specifically looking at this summation over m as a separate entity, we can see that this is really the expected sufficient statistics, That we defined in the context of the EML rhythm. And so we now have this expected log likelihood function, is the sum over i, the sum over little xi, ui, of the expected sufficient statistics for xi comma ui times the log parameter value. And the important point here is that this is just like a log likelihood function for complete data, Using the expected sufficient statistics. So the function that we're optimizing is this expected log likelihood. And just the simple mathematical derivation tells us that that is the same as optimizing a log likelihood function using these expected sufficient statistics. And we know exactly what that optimum looks like because it's the same as the standard maximum likely destination problem. And so going back to the diagram that we started out with, the function, this blue function that we're using here as an approximation, is this expected log likelihood. And this optimization that we're doing here basically uses the expected sufficient statistics to do maximized likely destination which is exactly the M step. So the E step, construct this function by computing the expected sufficient statistics and the M step gives us this maximum from which we now proceed to the next iteration. Using this kind of intuition, we can actually prove two very important results regarding the expectation maximization algorithm. The first is that the likelihood function or the log likelihood function improves at every iteration. That is, the log likelihood at iteration theta t plus one is greater than or equal to the log likelihood at iteration theta t or commensurately for the likelihood. The second result that we could prove is that at convergence, that is, at the point where theta t plus 1 is not changed relative to theta t, then we can be guaranteed that theta t is a stationary point of the likelihood function. What does a stationary point mean? It means a point where the gradient is 0, which in principle can be either a local minimum, a local maximum, or saddle point. But it's very difficult when you're doing hill climbing to actually land exactly at a local minimum, because that's the only time that we will converge to local minimum as if in step of. So here is the function. It's very difficult to imagine having an Step that lands us exactly at this local minimum. Because that's the only point at which we won't continue to make progress. And so, de facto, when we're doing this kind of climbing, and we achieve convergence, it means that we have achieved a local maximum of a likelihood function, which is the same guarantee on, satisfying as it might be, that we have for gradiant ascent.