[MUSIC] Okay, so now let's discuss how to find the gradient with respect to the parameters phi. So here's our objective, and we want to differentiate it. And again, let's rewrite the definition of the expected value as an integral of probability times logarithm as the function. And again, we can move the expected value inside the summation, it will not change anything, and also inside the integral. Again, if the functions are smooth and nice then we can swap integration and differentiation signs. However, in contrast to the case in our previous video, we cannot put the differentiation sign here inside the, so to push it forward near the logarithm. Well, first of all, because the gradient of logarithm of p of x Is zero because it doesn't depend on phi, so the gradient is zero. And so the right-hand side of this expression is just zero, and it's obviously not what the left-hand side is. And the reason why we can't do that is because the q itself dependent on phi. So we have to find this gradient of q respect to phi. And if we do that, then the problem here is that we no longer have an expected value. So if you look at the first equation this slide, it is sum of integrals of gradient of q times logarithm of p. And this thing is not an expected value with respect to any distribution. So you can't approximate it with Monte Carlo, you can't sample from some distribution, and then use the samples to approximate it, because there are no distribution here. Here is just gradient for distribution, which is not a distribution, and a logarithm of a distribution, which is also not a distribution. So we can't approximate this thing with Monte Carlo. And how can we do it, how can we approximate this gradient with something? Well, one thing we can do is the following. We can artificially add some distribution inside. So we can multiply and divide by some distribution q. And then we can treat this q as the probabilities. And then the gradient of q times log p divided by q is the function which we're computing the expected value of. Or if you simplify this expression a little bit, we can say that the gradient of q divided by q is just gradient gradient of logarithm of q by the definition of the gradient of logarithm. And then we can rewrite this formula as follows. So it's an integral of q times the gradient of logarithm of q times logarithm of p, right? So it's just exact formula, we didn't lose anything on some kind of approximation. And now we can say that this last expression is an expect value with respect to q. It's an expected value for this gradient of logarithm of q times logarithm of p. And this sometimes called log-derivative trick, and works for any distribution. So it allows you to differentiate some expected value, even if the gradient of this expected value is not an expected value itself. So now you have an expected value again, and you can sample from the q, and then approximate this gradient with Monte Carlo. It's a valid approach, and until recently people used it, and so it kind of worked. But the problem here is that actually this expected value is a correct value. So it's an exact expression, we didn't lose anything. But if you try to approximate this thing with Monte Carlo, we'll get a really loose approximation. Because the variance of that will be high, and we'll have to sample lots and lots and lots of points to get an approximation for gradient that is at least a little bit accurate. And the reason for that is because, we have this logarithm of p of x. And when we start our training, this p of x is as low as possible, right? because p of x is a distribution over natural images, and has to assign some distribution to any image. And so at the start, when we don't know anything about our data, any image is really improbable according to our model. So logarithm of this probability may be like -1 million or something. So the model at the beginning doesn't get used to this training data. So it thinks that these training images are really, really improbable. And this means that we are finding an expected value of something, times -1 million. And then because the gradients of the first term, the gradient of logarithm of q can be positive or negative, then we do Monte Carlo, and average a few samples. We'll get like -1 million plus 900,000 minus 1,100,000 and etc. So we'll get really, really high values in the absolute values, but they will be of different signs. And on others, they will be true, they will be around, I don't know, 100. And this is exact value for the gradient in this case, for example. But the variance will be so high that you will have to use lots and lots of gradients to approximate this thing accurately. And note that we didn't have this problem in the previous video, because instead of logarithm of p, we had a gradient of logarithm of p. And even if logarithm of p is really like -1 million, then the gradient of that will probably not be that large. So this is a problem, and in the next video, we'll talk about one nice solution to this problem in this particular case. So how can we estimate this gradient with a small variance estimate? [MUSIC]