We talked about the problem of learning models with missing variables, missing data, and discussed some of the complexities of that learning problem. So now we're going to basically bite the bullet and try and understand how one might go about doing, say, parameter estimation in the context of the missing data, nevertheless. So let's remind ourselves of one of the more important difficulties in learning parameters with missing data. So let's draw here, the likelihood function that, we might have. So the X-axis here is the param-, is a pictorial description of the parameter vector theta. And it's a one-dimensional depiction, but usually, of course, it'll be multi-dimensional. And what we see here on the Y-axis is the likelihood function. In the context of complete data, as as we might remember, the likelihood function had this really nice form. It was a nice log concave function. And in the context of Bayesian networks, it could actually be optimized in closed form using, using a simple formula. In the context of missing data, however, the situation became considerably more complicated, and the likelihood function looks more like this. It's this nasty, complicated, multi-modal function, which is really impossible to optimize in closed form. So how do you go about optimizing a function that looks like this? There is, it turns out, two general classes of strategies. The first is to use a generic optimization method, such as gradient ascent. So here, for example, we have a pictorial, depiction of gradient ascent. We might start out with a particular point in the parameter space, and we compute the gradient at that point, and we go in the direction of steepest ascent, which might take us to the next point, at which point we compute the gradient, and we continue until we reach, what is going to be a generally a local optimum of the, of the function that we're trying to optimize. Now, what I described right now is a very simple gradient ascent over parameter space. But, in general, there's more efficient methods for improving the convergence, the rate of convergence of these algorithms. Methods such as line search or conjugate gradient methods, that basically are going to get us faster to whatever global optimum, or whatever local optimum we're going to end up at. So a key question when applying gradient adsent to function such as this, is how difficult it is to compute the gradient in this high dimensional space. Fortunately, in the context of bayesian networks we can actually come up with a close form solution for, that gradient and that, solution takes the following form, so the gradient of, are log likelihood function. relative to a particular parameter, theta of XI given UI, so this is in the context of the standard say multinomial Bayesian network and little XI is a particular assignment to big XI, and little UI is on assignment to its parents. That gradient has the following form. It's one over the value of the parameter at the point at which we're computing the gradient, times the summation over data instances M, of the probability of that particular assignment x, I, n, u, I to the node and it's parent's, that particular joint assignment, given all of the evidence given in, the f, instance, at the current value of the parameter. So, we basically look at each and every one of the data instances, compute the probability of this event, XI comma UI, given whatever observations I happen to have seen. whether xi and ui are observed or not and then I compute that probability, add this all up and divide by the value of the parameter. So that's a nice sort of useable formula except that we have to understand some of the repercussions of that. Specifically, this requires that we compute this probability over a node and its parents, that is a family. given the data m and we need to compute that for all variables I and for all data instances m. So, this effectively means that we have to run inference for every instance. And furthermore we need to compute the probability for every family in the Bayesian network. Now. Fortunately we can at least reduce the second part of this computational cost, which is the fact that we need to run this for all I by remembering the fact that we have, if we, say, calibrate a clique tree or cluster graph, then when we're done with the calibration. Because of the family preservation property, a node and its parents are all going to be in the same clique or the same cluster. So a single calibration process is going to give us all of these probabilities. At once. But that's only true for given data instance M, which means that effectively what we end up with is the following computational cost which is that we need to run inference for each and every data instance at every iteration of the gradient process, which is a very costly computation. Now, what are the pros of gradient ascent as a, as an optimization algorithm? it's a very flexible strategy. So, where as I, we just showed how the formula applies to table CPDs, it can also be extended to non table CPDs via the chain rule for derivatives. Which is not the same as the chain rule for Bayesian networks. however it also has some significant cons. first. It's, it turns out to be a little bit tricky. Not horrible, but a little bit tricky, to ensure that the parameters define legal CPDs. So this gives rise to a constrained optimization problem, where we need to make sure that, the entries in the CPD are all greater than or equal to zero on sum to one. And that imposes some additional burden on the optimization problem. And then it turns out that if you want to get any kind of reasonable convergence, you can't just do straight gradient ascent. One needs to do something more clever like conjugate gradient thorough line search which again tends to increase the computational cost. The second strategy, the one typically uses for optimizing the likelihood function is a strategy that is specifically geared for for likelihood functions. So it's a special purpose algorithm that doesn't optimize any old function but rather just specifically for likelihood functions. What's the intuition behind this algorithm called expectation maximization, or 'em for short? the intuition is that really we have a chicken and an egg problem here. If we had complete data then parameter estimation would be easy. Because we know how to do maximum likely in parameter estimation in Bayesian networks. In fact it's a closed form problem. Conversely, if we had the full set of parameters, then computing the probability of the missing data would also be easy. Easy in the sense that it requires that it can be done using straight probabilistic inference. Which may or may not be computationally easy. But conceptually, computing the probability of the missing data is an easy, well-defined problem. So, we have two sets of of things that we need to infer or estimate. One is the parameters. And the other is the values of the missing variables. Missing data and each of those is easy, given the other. And so what the ELM method does is effectively an iteration process or, in fact one can formally show a coordinate assent process where we start out with one set of parameters, use that to estimate the values of the missing data and then use that completion to re-estimate the parameters and so on and so forth, until we reach convergence. So more formally, we pick some starting point for the parameters, in fact one can also initialize on, on the missing, on the values of the missing data, it's more standard to op, to, in many applications to initialize with the parameters but both are completely legitimate. So say we pick a starting point for the parameters, and then we iterate these two steps. The first is what's called the E step, E step stands for expectation, in which we complete the data using the current parameters. And the second is the M step, standing for maximization, and that. estimates the parameter is relative to data completion and maximization comes from basically, something like maximum like, that's. That's where the M, in maximization comes from. And it turns out that this iterative process is guaranteed to improve the likelihood function at each iteration, that is, at each iteration, the likelihood function is going to be better than it was before the previous one. and so, this process is an iterative improvement process. Let's make this process more precise. So, let's consider each of those, E step and M steps, separately. So, the expectation or E step, remember this is going from the parameters to the missing variables. for each data case, D, and each family, X, U. We are going to compute the probability of X, U. Given the observed values in in that data instance, and given the current parameter setting theta t. With that completion which note is a soft completion. Because it's a probability distribution over the parameters, over the missing variables not a single assignment to x and Du. With that soft completion we're going to compute what is called the expected sufficient statistics. The expected sufficient statistics is very similar to the sufficient statistics that we had before. when we were computing sufficient statistics we were counting the number of times that we saw particular combination X comma U, remember we had M. Of x coma u as being our sufficient statistics. Now we have a soft completion, and so, we have expectant or soft sufficient statistics, which we are going to denote with an m hat. Subscripted by the parameter vector that gave rise to the completion, and N hat sub theta T for a particular assignment, X comma U, is going to be simply the sum. Over all the data instances of the probability that we just computed. The probability of x coma u given the assignment, d sub m, and the parameters, theta t. And that is a notion of expected counts or expected sufficient statistics which we're now in the context of the m-step going to use as if they were real sufficient statistics. And so we're going to take these expected sufficient statistics and use them to perform maximum likely destination using the exact same formula that we would have used had these been real sufficient statistics. So specifically for example in the context of multinomial Bayesian networks we would have that the next value of the parameters theta t. E plus one given XU is N. Halved and bar of theta t. X u, divided by m bar theta t of u, which is in, using the soft count, the fraction of x u among the u's. Let's consider, a concrete very simple example, which is that of, bayesian clustering, so what we see here is your standard naive base model, where we have a single class variable, and, a bunch of observed features, and the assumption here is that the, class variable is, is missing, that is, it's. Sometimes or never. In most cases, never observed in the data. And so, effectively, we're trying to identify what are the categories or classes of instances in this data, under the assumption that, within each class, the feature values are, are are independent. That is, the features are conditionally dependent, given the class, as in the Naive Bayes model. So in substantiating our formula from the EML algorithm for this setting, we get the following very natural formula. So the sufficient statistics for for the class variable is simply the sum over all of the data instances of the probability of the class variable given the observed features for that data instances, for, for our instance and. The and the val- current value of the parameters. That is, it's the soft counts of how many instances fall into each category when each instance is allowed to divide itself up across multiple categories so to use a soft assignment. The second set of sufficient statistics, which represent the dependency between one of the, the features and the class variable, is M hat sub Theta of XI say comma c. And that once again sums up over all the data instances and simply looks at the probability of a particular class variable and a particular feature, a very, a particular feature given the observed instances, the observed features at that particular instance. Now with these expected counts, we can now go ahead and do maximum likelihood estimation. And that gives rise to the following very simple equations, which is that the theta C is equal to M hat sub theta of C, divided by M, which is the total number of data instances. And theta of little X-I given C is going to be assigned M hat, theta XI comma C, divided by M hat theta of C, which, again, is the standard, formula, but using soft, rather than hard, counts. So to summarize the EML algorithm, if we have in this case the exact same problem that we had in the context of the gradient assent algorithm, which is that we need to run inference the probability of xi, c for each data instance of each iteration. As for the gradient assent algorithm, this only requires a single calibration because of the same family preservation property that was useful there applies equally here. What are the pros of the EM algorithm? it's very easy to implement on top of an existing maximum likely destination procedure that one might have for complete data. So for having implemented and MLE procedure, one can easily build on top of that by adding a separate E step for computing the effect of the expected sufficient statistics. Empirically, it turns out that the EM also makes very rapid process, progress in terms of converging to a better and better set of parameters, especially in early iterations. And we'll see some empirical results for that in a bit. The cons of VM algorithm is that the convergence generally slows down at later iterations. And so it requires sometimes that later iterations might use say, some kind of conjugate gradient algorithm because it turns out that that's more effective there.