One of the more interesting stories in the field probabilistic graphical model relates to the connection between the, algorithm of Loopy Belief Propogation and the problem of decoding messages sent on a noisy channel. This is a connection that was discovered fairly recently and has proven to have a tremendous impact on both disciplines that of probabilistic graphical models and that of message decoding. So let's understand the message decoding problem. Imagine that I want to send K bits across a noisy channel. If I just go ahead and easily send those K bits, then it's impossible for me to know from the noisy bits that I received at the other end what the correct bits ought to have been because we don't know which ones got corrupted and which ones didn't. And so what people have, developed is the notion of coding theory. Where those K bits are passed through an encoder. And what we get at the end are N bits. where N is usually greater than K. And those are the bits that are actually transmitted across the noisy channel to give us a set of noisy bits, Y1 up to Yn, which we now hope to be able to decode in a way that ideally gives us bits V1 up to Vk, that are close. And hopefully, completely accurate relative to the bits that were originally sent, U1 up to Uk plus message coding and decoding in a nutshell. And the rate of a code such as this is k over n, that is you send n bits of which k are actual information. The bit error rate, that one gets, and that's a function of the channel itself, is the, overall probability that a bit that we get. On this side, say the first bit, what is the probability that the first bit is correct, that is, the same as as the original bit, bit? And so the error rate is the total average of the probability that these are different across the K decoded bits. So, we can now think about different channels and their properties and there's this very important notion called the channel capacity. So let's first think about different error rates that a channel, different error modes, that a channel might exhibit. So here's something called the binary symmetric channel, often abbreviated BSC, and the binary symmetric channel says that when I send a zero, then with probability.9, I get a zero, but with probability 0.1, I get a one. And conversely, with a probability, when I send the one, I get a one with probability of 0.9, and the zero with probability 0.1. You can see why it's called the binary symmetric channel because the area is symmetric between zero and one. A different error model, an erasure channel where the bits don't get corrupted, they just get destroyed. Now the big difference between this, the binary erasure, erasure channel abbreviated BEC is that when a bit gets destroyed, I know it. That is what I get at the end, is a question mark as opposed to a bit where I don't actually know whether that was the original bit that was sent or a corrupted, here I know the bit was corrupted. And finally, here is what's called a Gaussian channel, where the noise that gets added on top of is actually analog noise. And so, there is sort of a, a Gaussian distribution relative to the signal that was sent. And it turns out these different channels have very different notions of what capacity means. And capacity, we'll define exactly what implications that has in a moment but information there is. Your super smart people have computed, the capacity for these different kinds of channels, and it turns out, that for example the capacity for the binary symetric channel, is zero, is a little bit over 0.5, where the capacity of the binary eraser channel, is 0.9 where 0.9 is, is this, probability of getting the correct bit. And if you think about that it makes perfect sense that, the capacity of a channel where you know the bit were erased is higher than ones where you have to figure out whether bits are right or wrong. So this is, little less, a more benign type of noise. And the, what you see over here is the capacity of the Gaussian channel and it's a, its a expression that involves things like the with this Gaussian for example, the wider it is, the lower the capacity. Now, Shannon, in a very famous result known as Shannon's Theorem, related the notion of channel capacity and bit error probability in a way that tells us and pro, defines an extremely sharp boundary between codes that are feasible and codes that are infeasible. So let's look at the diagram. The x axis is the rate of the code. Remember, the rate was k over n. In the example that we gave and this is a rate in multiple channel capacity so this says once you define the channel capacity we can look at the rate of a code and each code will try for a dif could, could be in a different point of the spectrum in terms of the rate. On this axis, we have the bit error probability. So obviously, lower is better, in terms of bit error probability. And what Shannon proved is that this region over here. The, the, the, the region that I'm marking here in blue, the attainable region is, you could construct codes that achieve any point in this space. That is for any point in this, in this 2-dimentional space of rate and bit error probability you could achieve that that, you can construct a code. He didn't show it as a constructive argument, it was a, it was non constructive argument but he proved that there existed such codes. Conversely, he showed that anything that passes, on this side of the boundary the forbidden region is just not obtainable. That is not matter how clever of a coding theorist you are you, could not construct a code that had a rate above a certain value and bit era probability that was below a certain value, which is the shape of this boundary that we see over here. And you can see why the channel capacity is called capacity because this is a multiple of one. So this was Shannon's theorem and it set up as we said in non-constructive proof that the blue region was an attainable region. But the question is how can you achieve something that's close To the Shannon limit. And, around, up to, a certain point in time, the mid 90s.' This was, sort of, a diagram of the kind of, Achieve, what was achievable in terms of codes. And so this is a diagram we'll see several times from, in a minute. so here is, on the x axis, we have the signal to noise ratio, measured in db. And on the y axis, we have the log of the bit error probability. And what you see here are codes that were used, first up here we see the uncoded, version. And you can see that the uncoded is not very good. It, has very high, bit error rates, so high here is worse bit error rates, so, very bad. And, of course, as the signal to noise ratio moves to the left, the error rate grows. And what you see over here in these two other lines are. The bit error rate, sorry the, the curves achieved by two the NASA space missions Voyager and Cassini and you can see that there was a good improvement between Voyager and Cassini, which is 1977 to 2004. And then in May -three a revolution happened in coding theory. And this was a paper by Berrou, et al and they, you can read the title of the paper. It was a shocking title. It says near Shannon limit error correcting codes. And they called this ter- codes. And initially when they tried to submit this paper, people didn't believe them that this was possible because this was so much better than any of the previous codes that had been proposed. And people said no this can't be, can't possibly be the case. And, and they checked it and it turned out that, in fact, their code worked. But nobody really understood why. Okay, so now you might wonder well why the heck am I telling you all this. what does it have to do with the probabilistic graphical models? Oops. Okay, we have to scratch that. so why was this so surprising? Because if you look at the turbo codes in terms of this diagram that I showed you a minute ago, you can see that over here, we have again, this is, remember the uncoded. This is the same kind of diagram. Signal some noise on the x axis, bit, log bit error probability on the y axis. Here once again we have Voyager and Cassini. And there. And look at the turbo codes. The turbo codes are way to the left. Of any of these previous codes. Now, again, remember left is better. Left means that you are achieving the same bit error, probability with a higher signal to noise ratio. So, so what is the magic of turbo codes? and this is where we bring ourselves back probabilistic graphical models cause you might be sitting there thinking, well, you know, fine, coding theory's great, but this is the coding theory class. So what does this have to do with probabilistic graphical models? So to understand that let's look at what turbo codes do. So turbo codes took the same set of bits, that we want to transmit, U1 up to UK. And they pass them through two encoders, encoder one and encoder two. And then, those bits were passed through our noisy channel over here, so that what we get at the end is noisy versions of those two sets of bits. And what we'd like to do is really coding is a probabilistic inference problem when you think about it, because what it's trying to do is it's trying to compute a probability distribution over the message bits given my noisy evidence Y1 and Y2, which are the noisy bits I received on the channel. But instead of trying to solve the probabilistic inference problem exactly, what Berrou et all proposed in this Turbocoding paper is this sort of iterative approach, where we have a decoder that matches the first encoder. So decoder one matches encoder one. And decoder one takes these noisy bits that you get from encoder one. that, after they pass through the noisy channel. And decoder two gets a separate set of bits- the bits that were passed through encoder two and then the noisy channel. And what happens, roughly speaking, in the codes, is that the two decoders start to communicate with each other. So, decoder one looks at it's bits and it doesn't exactly know what the correct bits are, but it computes a posterior over those noisy bits. And it takes the posterior over each of the noisy bits and it passes it. The decoder two, which now uses it as a prior, over, the, values of the, of these target bits the U, combines it with the evidence, from the noisy bits Y, and now computes a more informed posterier over the U's, which it then sends, to, decoder one. And this process continues to itterate until convergence. So each decoder computes a posterior and passes it to the other decoder, which uses it as a prior. And if you look at the, the implications of the iterative algorithm what happens so you can see here is, is the, are the different values for turbo, for turbo decoding and what you see here again is the signal to noise ratio on the X axis the log of err probability on the Y axis and you can see that initially in the first set iteration of turbo decoding the curve is not that great. But then as the iterations improve, the bit error rate goes down. so for example for a fixed signal to noise ratio say here. The bit error rate goes down from here, to here, over a durations. And, what you see as, as the dark line at the bottom is the optimum bit decision if we were to do correct probabilistic inference. And so the surprising part was how close this totally heuristic and unmotivated algorithm as it seemed at the time comes to the optimum of the bit decision. The revolution that happened in this field came up when other people realized, people like McLeese and McKigh and Fray, realized that what was really going on here is that this algorithm that we saw of, over here is running a variant of loopy belief propagation, because what's going on here is that each of these decoders. Is actually running exact inference over a network that is sort of attractable and then its passing messages these del these beliefs that there being passed are what we have as the delta Ij's between in this case these two components. Really, what was going on here are are two decoders that are communicating with each other via a process that is exactly identical to the loopy belief propagation. And this caused the revolution in the field in two different ways. Actually, it caused a revolution in two different fields. It caused a revolution in the field of probabilistic graphical models because up until that point people had known that you could pass loopy messages over the graph. in fact this was an observation made as early as 1988 by Judea Pearl when he wrote the seminal book that really introduced Bayesian networks to the field of artificial intelligence. He wrote at the time that sure you can pass these messages over this loopy graph but it doesn't, but you have no guarantees of convergence, and you have no guarantees if this gets the right answers and so it's not perhaps a good idea. And from 1988 until the mid 1990's the algorithm had been completely abandoned. Because people said that it had these limitations, so it wasn't a good idea. But when. People realize that in the context of coding theory, this algorithm, which is obviously heuristic and has all these potential limitations. Never the less, succeed in achieving performance. That looks like this, then people said, wait a minute. If it works so well for decoding, maybe it's a good algorithm to think about after all. And from the mid 1990's until today, set up a huge amount of work in the field of probabilistic graphical models on understanding when and why loopy belief propagation works as well as constructing through versions of loopy belief propagations, some of which we briefly mentioned in this course. So that's the first revolution. The second revolution occurred in the field of coding. Where people said, well, if. All that's going on is that we're running loopy belief propagation over a graph that is trying to compute the posterior over the use, the message bits given the noisy bits otherwise. Well, we can construct lots of other codes. And use those codes and run loopy belief propagation on the resulting graph. And sure enough, some of the more successful codes that are in use today. Are, in fact, not the, necessarily the turbo codes that Berrour et al originally came up with but a class of codes called low density parity checking codes. That are actually really good in elegant match to loopy belief propagation. Now, it turns out low-density parity checking codes were actually invented by Robert Gallager in 1962. And since 1962 up until, sort of, the late'90's, they had been totally abandoned because they were computationally intractable to do in terms of exact probabilistic inference. But by realizing that one could run a loopy belief propagation as a way of decoding using these codes, that reintroduced the codes into the field, and now their wildly successful and there's all sorts of variations on them and extensions and so on. So what are these parity checking codes? So in a parity checking code, we are actually sending two types of bits. So, these are original bits U, U1 up to U4. The original set of bits. Are just, the first set of this are just the original message bits. So I denoted them here even though x1 is actually identical to u1. I just wanted to denote specifically x1 is the sent bit. So x1 is just the copy of v1. X4 is a copy of u4. And once again the Ys are the noisy. Bits that are received after transmission. On the other side, the Xs that you see here on the bottom are parity bits. Parity bits are like checksums on a file. They look at then completely different, the parity of different subsets of the bits. So X5 here is the parity. That is whether even or odd, zero if even and one is odd of U1, U2 and U3. And see U1, U2 and U2. X6 tests the parity of a different set of bits. U1, u2 and u4, and x7 does u2, u3 and u4. So here, we're sending four original bits and three parity bits. So, this in total is a code whose rate is four over seven because there are seven bits sent of which four message bits or four message bits. And when you think about this, this really lends itself beautifully to a belief propagation algorithm. Because you have these because you have these factors over here, these This factor over U1, U2, U3 and the receiver bit Y5. And so messages are passed over a graph of a given [INAUDIBLE] that has these factors associated with it as well as the factors over the singletons U1 up to U4 separately relative to their evidence Y1 up to Y4 separately. And so the coning is now done as loopy belief propagation where we have as we said the U1 cluster. U2, U3 and U4. These are my singleton clusters. And then these are the large factors. And some of you will recognize this as a beta cluster graph. A beta structure cluster graph. And so this is just a list of applications, this is probably the, maybe the most, one of the most ubiquitous applications of probabilistic graphical models in use today simply because it exists in so many instantiations, you know, of many set top boxes and when you're doing digital video broadcasting in communications, mobile televisions, mobile telephones, to satellites, NASA missions, wireless networks and so on and so one of the really powerful applications of loopy belief propagation of probabilistic graphical models that arose from this unexpected confluence of these two disciplines. So. Really you could say that loopy belief propagation which was originally discovered or proposed by Judea Pearl in 1988 was actually rediscovered, by practitioners of coding theories. These were not theoreticians of coding theories, these were people actually sat down and engineered codes. And understanding those turbo codes in terms of loopy belief propagation and that was done by a bunch of more machine learning and information theory type people led to the development of many, many new and better codes, and the current codes are actually coming gradually closer and closer to the Shannon limits in a constructive as opposed to a theoretical way. And at the same time, the resurgence of interest and belief propagation led us to much of the work that we see today in approximate inference in graphical models as well as both on theoretical understanding side as well as on new algorithms that people are coming up with derived from that new understanding.