We define the belief propagation algorithm as a message passing algorithm over a fairly general data structure called a cluster graph, and we showed that if the cluster graph has certain properties, then so does the belief propagation algorithm. What we're going to do now is we're going to look at a special case of the belief propagation algorithm, where we pass messages not only a general cluster graph, but rather over a data structure called a cleak tree, which as we'll show as a special case, a cluster graph. But interestingly, when we pass messages over cleak tree, we can show that the belief propagation algorithm has much stronger performance guarantees than it does over the general graph, and specifically we can show that not only can we get convergence must faster, we can also get convergence. To the exact answers for probabilistic inference. So let's consider the case of passing messages tree, and let's start with the simplest kind of tree which is just a chain. So in this case we have a chain A, B, C, D, E and we're going to assume this is and undirected MRF or CRF, and so we have a for simplicity we're only going to consider the Pairwise potentials without loss of generality, and so we're going to have just Pairwise potentials here that. For example, in the cluster a b, we're going to have the pairwise belief, the pairwise potential over a b and so on. So. so here we have si one. Side two. By three and si four. So now let's think about the message passing process, and we're going to start passing messages from the left. And so psi, so cluster message the cluster two and so there is no incoming messages at this point the message that get, that gets past is simply the sum over A of psi one of AB. And that's a message who's scope is B. Now we're going to have cluster two take that message and pass the message to cluster three. And so now it's taking it's own potential, side two. And multiplying in this message into this. And the result is a factor over B and C. And we're going to sum out over B. Because we only want to restrict to the scope C. which is the shared scope between cluster two and cluster three. And so delta two, three is this message over here. Now we're going to pass messages, one more time all the way down to the end. And so now, again, we're taking delta 2, 3, multiplying it with psi three that comes from there. And that gives us a message over scope D, which gets sent to cluster four. You could also pass messages in the other direction. So this for example, would be what would happen if we passed messages from cluster four to cluster three. regardless of when that message gets passed, the, there's no other messages that cluster four would multiply in when sending a message to cluster three. Because, whether it got the message from three or not, it's not going to multiply that in. Because, remember, you only send the cluster a message that doesn't involve messages that you received from it. And so, in this case, the message that you get is the, is the sum over E of the factor psi four, which we get from here. Passing messages in the other direction again, we see that the message that cluster three passes to cluster two. Which is delta 3-2. Involves psi three, and this message over here. And finally, exactly the same, idea. These are the messages that gets passed. Okay. The one important property that you have when you run message passing in the way that we just showed is that somewhat surprisingly relative to what we've seen before the resulting believes that you get are correct. And let's convince ourselves about relative to cluster three and so in the order that we defined cluster one pass delta twelve to cluster two which was then used to define delta 23 at the same time we had delta 43, and so the believes that we get here are the product of cite three? And delta two three is delta four three. Let's unpack each of those messages. Though delta two three, as we remember was defined here. And so we are going to rewrite what delta two three is as the summation over B of si two delta one 2.. And similarly we unpack the delta four three to get using this expression that we have over here right there. Now we're going to do one further level of unpacking. And we're going to remind ourselves of what delta twelve is, which is as defined over here. And so that gives us this expression over there. And now look at what we have. We have a product of all the four initial factors in the network. Psi one, psi two, psi three, and psi four. We have a summation over all the variables that are not in cluster three, which are a, b, and e. C and d are in that cluster, so the ones, we sum out over everything else. So when do we see that we have a product of factors. Marginalized out. where we marginalize out irrelevant variables or unnecessary variables. So this is really reminiscent of variable elimination. And, if you recall when we talked about the correctness of variable elimination. We said that variable elimination is correct, as long as you, as when you sum out a variable, you first multiply in all factors that relate to it. That involve the variable in their scope. So let's convince ourselves that, that in fact was going on here so let's look first at the variable E and when we sum out E we're multiplying we only are we're only involving si four but that's okay because only the cluster four has E in its scope. here we're summing out A. And once again only psi 1 involves involves A in it's scope. Here we're summing out b, and really there should be parentheses here. They were there implicitly before. And so you can see that when I'm summing out b of involved side two, which involves seeing b in its scope, as well as what was left over from summing out side one, which was the other factor that has b in its scope. And so you can see that I've marginalized out the variables in a legal order, multiplying in only the expression multiplying in all of the expressions that I needed to multiply before I did the summation. And so this is a legal order of operations. And therefore, it gives me a correct result at the end. So now, let's expand that into a somewhat broader into the somewhat broader case. We're now going to define this notion called the clique tree. And a clique tree is an undirected tree, kind of like a cluster graph was an undirected graph. So, in this case, it's a tree. Such that, once again, the nodes are clusters. And, once again, we have edges between the clusters, which, are called subsets. And in this case, whereas before, we had the, the subset needed to be a, only a subset of CI interception CJ. It could be smaller. It could be equal or it could be smaller. Here, we're going to require that the subset is actually the intersection of the two adjacent clusters. So now let's go and understand what properties the cluster, the cleak tree that we showed, has, as a special case of a cluster graph. So for a cluster graph we had the notion of family preservation, and the reason that we had family preservation is because if we have a set of factors phi, we needed to take each phi k in phi, and we needed to assign it to some cluster, so that so that each case of information we have is represented somewhere. And the factor that needed to be such that the cluster can accommodate its scope, so that the scope of the factor was a subset of a set of variables in a cluster. And the same requirement holds here, because a clique tree is a special case of a cluster graph. So that we have that for every factor, phi K, there must exist at least one cluster whose scope is big enough to accommodate that factor. The second property that we had in cluster graphs was the running intersection property, and let me clarify that the property that we see here is the cluster graph variant. We're going to make it. Specific to clique trees in just a minute. So the cluster and graph variant told me before that for each pair of clusters C1 and CJ and a variable X that is present in both of them, there needed to exist a path between CI and CJ for which all clusters and subsets contain X, and it needs to be unique. So what does that mean in the context of creet trees, let's assume, that we have, C7, containing X, and C6, containing X, and, the cluster and the running intersection property says there needs to exist a path between them, through which, all clusters in sepsous contain ax. Well there's only one path in a tree between C7 and C3, it's this one. And so what that property now tells me is that x has to be here, here, here, and here. Because the running, otherwise the running intersection property is going to be violated in this case. So it means that in the contexts of trees, the running intersection property can actually be written much more simply, it can be read as saying that, pair of cos-tar ci and cj and, variable x in their intersection. Each, cos-tar and subset and unique path between them, must contain x, and so this is the just the instantiation of running intersection property to the context of the trees. So now let's look at a, a more complicated clique tree. This is the clique tree that corresponds to the networks that we previously used to demonstrate variable elimination this is the elaboration of our student network. And so this is unlike the previous example, this is actually a clique tree for a Bayesian network as opposed to a Markov network but as you'll see there's really no difference. So here we have a set of factors. in this case, the factors are CPD. And because of the, family preservation property, we can be assured that each factor, must be assignable to some clique. And so here, we have, p of c. And p of d given c are assigned to this clique. P of G gives an ID, p of I and p of s give an I and so on and so forth. So each of the CPD's we have is assigned to one and exactly one of the cliques in this clique tree. And, we can convince ourselves, and we can convince ourselves that this tree satisfies for example the running intersection property. So let's look for example at the variable G. The variable G exists here, it exists here, and sure enough it exists here, here, here, here, and here. And you can similarly convince yourselves of that for any of the other variables in the squeak tree that the running intersection property holds for them. Let's look at what message passing looks like for this more elaborate clique tree. So it's a little bit, sli, only a slight bit more complicated. So here we have, for example, delta. so it's not even that much more complicated, so here's delta two, three, for example, and delta two, three, which is the cluster betwe, which is the message paths between cluster two and cluster three takes, thy two, which is, the product of the factors assigned to this quee, there's actually only one in this case so it's only PG gets and IND, so that's thy two, multiplies it with the incoming message delta one, two and sums out, over D. The only slight difference between this and the previous example that we had is that here for example when cluster four passes a message to cluster three, we can see that over here there's two variables, J and L, both of which must be summed out in order to produce the message over the requisite subset. Now, we preview,, we defined the running interception property. One of the really, nice aspects of the running interception property, is that it, by itself, together with family preservation, suffices to prove the correctness of message passing in a clique tree. And the reason for that is the following. if so let's consider a variable x over here. And, let's and what we want to prove is that is that if we eliminate X at some point in the tree, at some point in the message passing process. Then we have multiplied in all factors at involve X. And, so, here we have, for example, X here and here, and by the running intersection property, we have the X is here and here, as well and on this axis. And, let's imagine that X is eliminated on this edge. That is the message that's passed through C3 to C5 doesn't the message is passed from C3 to C5, doesn't involve X, and so X must be summed out at this point. Well, at this point we know that X cannot be here. Because if x were in c5, then it would also be in the subset. If x in c5, then x is also in s35. At which point, x wouldn't be eliminated. And the reason we know that if X is in C5 then X is in S3, 5, is because of the running intersection property. So the running intersection property relates the structure of the graph to the structure, to the the time at which variables are eliminated. So now let's convince ourselves of the correctness in this slightly more complicated example that, that we've seen satisfies the running interception property. Here, once again, we have, psi 1, which is the product of these two factors. Psi 2, which is this one, psi 3, which is these product of these two factors. Psi 4, psi 5. And once again let's, for example, consider what we get at cluster three. And we can see that we have multiplied in all of the potentials, side one, side two side three, side four and side five. And that we have eliminated all of the variables except for g, s, i. So, here we've eliminated c, d, j, l, and h, leaving us only with the variables g, s, i, which we want to keep. And we can similarly convince ourselves, as we did before, that when we eliminate a variable, we've multiplied in all of the factors that involve that variable. So to summarize we can take the leaf propagation and run it over a cluster graph that happens to be tree-structured. And when we do that it turns out that the computation that we're running is actually a variant of variable elimination. And because of that the resulting beliefs are guaranteed not only to be calibrated, but to actually be the correct marginals of the un-normalized density P tilde phi.