Perhaps the most common use of the expectation maximization algorithm is to the problem of Learning with Latent Variables these are situations where there is a certain set of variables that we just never get to observe but they turn out to be important for capturing some important latent structure in the distribution of the data instances that we have. So let's look at a number of these examples and talk about how 'em was applied in those settings. So perhaps the simplest example, is in context of base in clustering. Where we have some kind of, latent class variable and some set of observed features. And we're trying to, basically segment the instances that we have into categories. Where, we assume that, for each category there is a similar distribution over the observed features. And, specifically, in many of these applications, that distribution takes the form of a [INAUDIBLE], [INAUDIBLE], and that is, that the features are independent, once we're given the class. So one example application of this is to segmenting a set of users into similarity groups. And so one cute example of that is the following segmentation of users on the MSNBC website based on which of the stories on the MSNBC site the users chose to click and this is due to Jack Breeze and colleagues at Microsoft research So here we see. We're going to see four clusters. The first of these is a cluster that a human given name to, because the machine doesn't give a name to the cluster, it just calls it cluster one. Turns out to be readers of commerce and technology stories. So these are people for whom the highest probability clicks on stories were for example email delivery isn't exactly guaranteed or should you buy a DVD player. So remember, the stories are the features and their binary variables did you click or did you not click, on that particular article. And the cluster is the segmentation of the users into these different categories, so this is category one. the second cluster are people who primarily clicked on sports stories. So, the highest probability features in this, for this user group were for example, stories such as Cowboys Are Reborn in Win Over Eagles, or something like that. The third cluster are the people who went to the top page and basically clicked on the top stories. This says twenty, 29% of the users. And one can. Date this by, at least this article, oh this one seems pretty perennial. and so we can so that's a third category. The last one was the surprising one. These are people who traversed all over the site. So for the previous three categories, there were actual pages that existed for sports, for top promoted stories, and for commerce and technology. This last category are people who went all over the site to look for stories that you might call softer news. and, so, this was an interesting discovery for the MSNBC website about its user population and suggested that the website might be better redesigned to capture this kind of structures and give these users a page where they can find these articles. A very different application is to speech recognition, which is one that we've talked about before, in the context of speech recognition, we've already, discussed that the standard representation of choice is a hidden Markov model, and in a hidden Markov model, we have hidden variables and we have parameters. The parameters are the parameters of the HMM, the transition probabilities for example, between one phone and the other, or the acoustics or the probability over the acoustics signal that we observe for different pieces or states within a phone. The latent variables are the segmentation of the initial acoustic signal into which state it belongs to. That is which full name and which part of which full name, those are the hidden variables. And that's an awful lot of hidden variables. Because if we are just given an undifferentiated, unsegmented signal, it's very difficult to assign. As, as a piece, a small slice of that segment of, of that speech into one fully versus the other. So in order to train well a speech recognition HMM, the standard strategy is a bottom up bootstrap training where one first trains the phone HMM separately for each phone using, a phone database. Now this to is done using the m because the breakdown within a phone into these little constituent pieces is not labeled. And so it's still requires that we d 'em to address the issue of latent variables, but at least it's now self contained, because you're only training a model for a single phone. like puh. With that model trained, we can now one can now take entire words, and use the model, initialized with the phone with the model trained on individual phones, to now train the higher level model. And one still retrains the phone HMM parameters. In the context of this larger training regime where one trains on entire words, but the fact that one seeded the model with this much more correct initialization on the parameters allows the segmentation in the E step to be performed moderately correctly and give rise to a much better local optimum in the speech recognition problem. A yet different application is that of 3D robot mapping. So, this is an application that's due to thrun at all, and . Here the input to the, to the problem is a sequence of point clouds obtained by a laser range finder that were collected by a moving robot and the target output is a 3D map of the environment as a set of plains and you will see the difference as to why we want the plainer map when we look at the demo. Here the parameters of the model are the location and the angles of walls or the plains in the environment, so we have no idea of priori where there walls and how there situated. The latent variables are as an association. Variables, which assign points to walls. And so the EML Rhythm effectively in the E step figures out which points go with which wall, and in the M step, it figures out how to move the wall around to better fit the points that were assigned to it. So what we see here is the raw data collected by the robot. The red box is the robot moving around in its environment. The red beams that emanate from the robot are the directions that the laser took in order to collect the point cloud and what we see here is the point cloud that was collected by the robot as a Traversty environment. And even just looking at this image we can already see that there is a lot of noise in the laser range data, and that and that is going to give rise as we can see to a very noisy map of the environment. If we now look at the marvel constructed by Yen. So now we are going to see the planer map constructed by the robot. And this is done on the fly actually as the robot is moving. We can see that. walls are constructed when there is enough data to support them and, that's when enough points are assigned to a wall then, that wall gets constructed and it's pose in the environment gets determined by the EN algorithm. And, so we can see that everybody's construct a much more plausible and realistic map of the environment than just looking at the raw point cloud data. . A, different application is also in the context of 3D laser range scans. And, you know, we pick those, not because these are in the most common applications of the M. But because they give rise to some pretty cool movies. So, here is, the problem of getting, in this case 3D, brain scans of a person in different poses. And the goal is to see whether by collection a bunch of these poses one can reconstruct a 3D skeleton of. Person. So from front and front back. So, in this particular case. the first problem is actually to correspond points in the difference cans to each other and in this case that was done by a different algorithm that I'm not going to talk about now although it also used graphical models in fact it used a belief propigational algorithm. But now lets talk about the clustering problem that is the problem of assigning points in each of those scans to body parts. So here we have the notion of a cluster which in this case corresponds to a part and. Each part. in a given, in a given, scan of the person, has a transformation, so that's why we have a plate around, each of the parts, because we have an, extenty, fo, because, for each of those parts there is multiple estantions and then multiple, scans that we have the same person in different poses. Now a landmark is a particular point on that On that person, in that, in that part, and if we knew the part, that is, if we had the part assignment, which is a latent, unobserved variable, then we could predict how the point that we see. On, this scan, would translate to the point on this scan. Because that would be effectively a, deterministic or close to deterministic function of the unobserved, transformation of the part. That is the fact that the arms, they moved from this pose over here to this raised arm pose over here. So given the transformation and given that we know which part. The point or landmark is assigned to. We can predict how the point transformed from its original position to its new position. So this is a way of clustering points into parts and one can run 'em on that and if one does that effectively one gets pretty much garbage, because it turns out that there is enough muscle deformation to make the actual positions here fairly noisy and it's very difficult to get correct part assignments from that, but if we now add an additional component into this model. Specifically continuity of space, we can now consider say two points that are adjacent to each other and impose the soft constraint, not the hard constraint but a soft constraint that a part assignment of two points that are adjacent should be softly the same, doesn't have to be exactly the same because otherwise of course everything would be assigned to the same part, but it's a soft constraint which is actually an MRF constraint. In fact it's even an MRF constraint, which is associate or regular. As we discussed before. And so these this model now that actually does the clustering not of each point in isolation but rather assigns points jointly to parts taking into consideration the geometry of the person's body is a considerably better gives rise to considerably better outputs. And what we see here is the algorithm in action. And we can see that the algorithm converges very nicely to a partition. that points into different parts and from this one can easily reconstruct the skeleton. A final application that uses 'em in a different model is one that was used in this helicopter demo alignment. Here the input and this is work that was done at Stanford by And the rings group. And, here the input to the algorithm is basically, different trajectories of the same aerobatic maneuver flown by different pilots. So they were all trying to do the same thing, but each person has their own idiosyncratic way of doing this. And so the, the exact sequence wasn't, as we'll see, exactly the same. The goal of this was to produce an output which aligns the trajectories to each other and at the same time learns a problemistic model of what the target or template trajectory ought to have been. So how does this get represented as a graphical model, here we have, a latent trajectory which is the intended trajectory that, that, the users were aiming for and, the states an nose are labeled as zees and we have parameters, that tell us how, one would move from Z0 to Z1 and so on. So those are the model parameters of this mark off model. although it's a hidden Markov model, as we'll see. The expert demonstrations, that is, what we saw in terms of what the expert did at each point in time, is a noisy observation of the intended trajectory. But, we don't actually know which point in the trajectory the expert observation comes from. And so we have this additional set of latent variables, which are the time indices. And they basically tell us that at time tau zero. The demonstration which of the. which of the true, states in the intended trajectory, the expert was at, at this point and, in towel one, tells us which point in the intended trajectory the expert was in time one, and similarly for time two, and so on. We're, Kebba is the, is the expert demonstration, so in the, so K is the index, on the expert. So. This problem can be solved using an expectation maximization algorithm, where we have two sets of latent variables, the taus and the Zs, and one set of model parameters, which are. The transition probabilities. Although, it's possible that we can also. It's also possible to learn the model for the, expert demonstrations. Or we could simply fix the noise model. Both are legitimate strategies. So let's see how this works. let's first look at the original trajectories. And we can see that, initially, they start out aligned. But, after a while, they start to diverge. So, for example, by the time we're right about here, you can see that the different experts are at very different places in the trajectory. And so the question is, how can we ad-, address that limitation? And so if we look at the aligned model, which is what we see over here on the right. We can see that the algorithm manages to maintain alignment. Fairly well, and that here you can see. That the. Different trajectories are pretty much at exactly the same point in the sequence. And from that, it's much easier now, to learn the model. And have the helicopter fly the new trajec-, fly the correct trajectory. So one important issue regarding latent variables is the number of values of the latent variables, and this is something that, has been implicit in many of the applications that I've mentioned in this module. So, how many user clusters do we have? Or how many, different pieces of a phone are there? And, so how do we pick that cardinality for the latent variable? If we use likelihood to evaluate a model with fewer values for the latent variable versus one with more. It's in the same way that likely it always over fits. We will find more values is always better, because it is a strictly more expressive model and can therefore achieve a higher likelihood. Now, one can circumvent that in the same way that we circumvented, the likelihood overfitting phenomenon in the context of other structured learning algorithms. So we can use, for example, a score that penalizes complexity. Such as the BIC score. as we mentioned back then, the BIC score tends to under-fit and so there has been a range of extensions to the Bayesian score for the context of incomplete data. These are always approximations because we can't actually do the integration required for the Bayesian score for incomplete data. And so we have approximations to the Bayesian score that turn out to work much better in practice than BIC in many practical applications. There's also other strategies that people have adopted. For example, we can use metrics of cluster coherence to look at a cluster and say. Ooh, it seems like this cluster is really incoherent. So maybe there's really two clusters in there. So maybe we should split it up. Or if these two clusters are very similar to each other, maybe we should merge them. And there is various heuristic exploration algorithms that basically split clusters, and add clusters in a way that hopefully converges to the right number of clusters. And finally of great popularity in the last few years, is a range of methods that are based on Dasian techniques. these are methods, often, the most commonly used class is called [UNKNOWN] prostheses. Which instead of picking a single carnality for the laten variable, basically maintain a distribution, over the carnality. And then, instead of picking a single value for that cardinality average over the different cardinalities using often techniques such as Markov Chain Monte Carlo. Although there's also other approaches. But this turns out often the, the problem of picking out latent variable cardinality is often one of trickier issues that one has to do deal with when learning with latent variables. To summarize. Latent variables are a really common scenario in the context of learning graphical models. And one of, perhaps, the most common setting for incomplete data. And it turns out that, when we want to construct models for richly structured domains. The introduction of these latent variables can be, Can be critical in the success of the model. We it's, we've already mentioned that latent variables satisfy the missing at random assumptions. And so the expectation maximization algorithm is applicable in this case. And, and is very commonly used for for optimizing latent variable models. However, it's we've, we've already discussed this previously, as well. That when we're learning with latent variables, we have serious problems both with identifiability. Of the learned model as well as well as multiple optima that are often even symmetric reflections of each other which is a faceted unidentified ability but also ones that are not reflections or just symmetric permeantations of each other and those really necessitate a good initialization strategy. And we mention that picking variable cardinality for the latent variables is a key question in in these latent variable models and one to which a lot of work has been devoted to especially in the last few years.