Hi, and welcome to week four of practical Bayesian methods. This week, we're going to cover the Markov chain Monte Carlo which is kind of a silver bullet of probabilistic programming, because it allows you to train or to do imprints on almost any model without too much trouble. We'll discuss some extensions of this technique. So how to make it faster in some particular problems, by exposing the structure of the problems, and we'll also discuss some limitations, like the speed of the method. So, let's start the discussion of Markov chain Monte Carlo with the Monte Carlo part. Monte Carlo was originated in the Manhattan Project which gave us the atomic bomb and it was a quick and dirty answer to the question, what to do until the mathematicians arrives. The idea of Monte Carlo simulation is to, so if you have a complicated model of some problem and you want to predict something about this model, instead of thinking carefully and, you know, writing down equations and solving them, you may just simulate your model lots and lots of times and then see, for example, the average of the simulation as some kind of smooth prediction of what will happen. So, probably the reason why this thing was invented in Manhattan Project is because it's one of the first really huge scientific projects where we already had lots of means to approximate integrals and to do these kind of simulations. And second of all, we already had computers in some reasonably useful state, where we can actually assimilate stuff, not mentally but automatically. And the Monte Carlo algorithm turned out to be pretty efficient, and it made it to the list of top 10 algorithms of 20th century. So the name Monte Carlo was given by the name of the city Monte Carlo which is famous for its casinos, and probably because, you know, everything in Manhattan Project had to have its code name. So let's see a small example of what I'm talking about. So, what is Monte Carlo applied to some really simple problem. Say you want to approximate the value of pi, and you know that pi is the area of the unit circle, a circle with radius one. So we can, kind of, use this information to approximate it. And the idea here is that if you consider a rectangle from zero to one on both axes, then this quarter of a circle which appears in this rectangle, has exactly the area pi divided by four. It's just one-fourth of the circle. And its area equals to some integral which basically says at each point whether this point is in the circle or not. So if we integrate this thing along this rectangle, we'll get the area. So how can we compute this integral? It's hard. But we can throw lots and lots of points on this unit rectangle and just see the fraction of those points that are put in the circle, in this quarter of the circle, and use this as an approximation to what the integral really is. So, you can see that this maybe not the best method to approximate the value of pi because here, for example, you spend like 3000 samples. So you assemble 3000 points and the pi was approximate as 3.1, which is not the best accuracy you can get with this much computations. But anyway, the method is really simple. It's like two lines of code in almost any programming language. And this is the general theme of Monte Carlo simulations. So, Monte Carlo usually gives you something which is really easy to program, which is really universally applicable to lots of and lots of problems, which is very scalable to parallelization. So, if you have like 100 computers, then you can really easily parallelize this thing to be 100 times more efficient which is not the case for many other algorithms which are much harder to parallelize. And sometimes this Monte Carlo method can be slow compared to other alternatives. So, usually, if you sit down carefully for a weekend, like write down lots of equations and think, then you may come up with a better way to do, to solve your problem than to do just Monte Carlo simulation. Sometimes not, but sometimes it's just a quick and dirty approach which gives you the first approximation. But anyway, it's very useful. So, to be a little bit more general about this family of techniques, let's say that we want to approximate some expected value of some function f, with respect to some distribution p of x. And we will use, we will sample a few points for some end points from this distribution p of x, and use the average of f of x size as the approximation. So, here, we are kind of using sampled average instead of the expected value. And this thing has lots of nice various co-properties like it's unbiased, meaning that if you sample enough points, you can get closer and closer to the actual true answer, at least with high probability. And anyway, we're [inaudible] how fast you will converge to that. So, this is Monte Carlo. And you may ask a question like, why do we care? Why do we need to approximate expected values? Well, in permalink programming, there are probably a few reasons to do that. Let's consider two of them. So first of all, if you wanted to do full Bayesian inference, like we covered a little bit in week one, then, instead of having some pragmatic model and finding the best failure of parameters, you may just want to tweak your parameters as latent variables. And then for a new object X, if you want to predict it's label, for example, and you have some training data set x train and y train. You may want to just integrate out this latent variable like this. So, imagine that, for example, x is image and you want to predict like whether it's an image of a cat or dog. And then P of Y given X and W maybe for example a neural network. With weights W and which takes as inputs this image x and outputs your distribution over labels y. And then, instead of using this just one neural network with some kind of optimal or best set of weights w you're considering all possible neural networks with this architecture. So, for each possible value for weight's w, you consider this neural network. You pass your image through this neural network. You write down the answers the P of Y and then you average all these predictions with some weights which are given by the Pasteur's division the weights P of W given the training data set. So it's like an infinitely large ensemble of neural networks. Where the weights of the ensemble are given by the Pasteur's distribution P of W given the training data set. And this P of W given the training data set gives you some idea about how well these weights will behave as a weight for your neural network. So this value will be higher for reasonable weights and lower for the weights which doesn't make sense. So this way, you're doing full base in inference. You're having a few benefits of probabilistic modeling like you can, for example, predict uncertainty in your predictions. And well basically, instead of just fixed value of weights, you have a distribution over weights. And this thing is, of course, intractable. So if you have a complicated function like in neural network here, you can compute this integral analytically. So you have to do some approximation. And one of the way to do it is to just say that this thing is just an expected value of this neural network. So we have a neural network output and the computing the expected failure of this neural network output with respect to some distributional weights. And so, we can approximate this thing with Monte Carlo. If we can sample from P of W given the training data set. So this is one example, where the Monte Carlo can be useful to approximate expected value in the probabilistic modeling. And you may ask how we can find this posterior distribution of the weights P of W in the training data set? Well the posterior distribution is proportional to the joint distribution. So likelihood P of Y train given X train and W times the prior and W and divided by some normalization constant. And so the numerator here is easy because you defined your model yourself. You could have, for example, defined P of W the prior to be normal and the likelihood is just your output of your neural network. So you can compute this thing for any values of Y, X, and W. But then, the denominator, the normalization constant Z is much harder. It contains something to go on site so you can't ever compute it. So it's kind of a general situation where you know your distribution from which you want to sample but only up to normalization constant. Right? And other example of these kind of expected values approximation can be useful in probabilistic modeling is the M-step of the expectation maximization algorithm which we covered in week two. So, on the E step of expectation maximization, we found the posterior distribution latent variable T given the data set and parameters. And on the M-step, we want to maximize the expected value of algorithm of the joint probability. With respect to the parameters theta and the expected value with respect to this posterior distribution Q. So, here again, if the posterior distribution Q is too hard to work with, and we can't integrate this thing analytically, we can do a few things. We can, first of all, we can approximate Q to be some to lay in some simple class like factorize distribution. And that's what we did in the previous week, week three about variational inference. So we're approximating the posterior and then computing these integral analytically for the simplified posterior. Or we can use the exact posterior but we can sample some data set of samples from this set of latent variable values and then approximate this expected value with just average with respect to these samples and then maximize this approximation instead of the true expected values. So these are two examples of where you can need expect [inaudible] and probabilistic modeling and how Monte Carlo simulation can help you here. And in the following videos, we are going to use just, we're going to use the formulation of this problem is as like we want to learn how to sample from a complicated probabilistic distribution which we know up to normalization constant. And we will always write like p of x meaning that x is the parameters you want to sample. So here we had, in the first example of full Bayesin interference we had p of w giving some data. And in this second example we have p of latent variable giving some again data. But in all cases, the full ingredients, we will just write down, you know, p of x meaning that it [inaudible] distribution we want to sample from.