[MUSIC]

In the previous video, we completely defined our model.

And now everything that is left is to understand how to maximize it,

with respect to the weights of both neural networks, w and phi.

So we have to maximize this kind of objective.

And since it hasn't an expect value inside,

we have to approximate it with Monte Carlo somehow, right.

So let's look closer into the subject.

First of all part is easy.

Because it's just KL distance between some Gaussian with

known parameters, and the standard Gaussian.

So we can compute this term analytically.

So although it has an integral inside, we can compute it analytically.

And this expression will not cause us any trouble,

both in terms of evaluating it and finding gradients with respective parameters.

So we can just not think about it and let TensorFlow think about the gradients,

if we define the diversions, as this kind of analytical formula.

So let's look a little closer into the first term of this expression.

That's called f of parameters w and phi.

So this function is sum with respect to objects,

of expected values of logarithm of probability.

And recall that we decided that each q i of individual object,

would be some distribution, which on t i, given x i and phi.

Which is defined by convolutional neural networks with parameters phi.

So let's re-write it as false, and

let's start with looks at the gradient of this function with respect to w.

So the gradient of this function with respect to w,

it looks as false, so half the gradient of sum of expected values.

And we'll write the expected value by the definition.

So latent variable t i is continuous, and thus,

the expected value is just the integral of the probability times the function,

the logarithm of p of xi given t i.

Now, we can move the gradient sign inside the summation.

Because summing, taking the gradient do not interfere with each other,

we can swap this sides.

And also for smooth and nice functions,

we usually can swap the integration and great gradient sides, like this.

Finally, since the first function q of t i given x i and phi,

it doesn't depend on w, so we can easily push the equation side even further.

Because this q is just a constant with respect to w.

And it doesn't affect the value of gradient,

we just have to multiply the gradient of logarithm with this value.

And now we can see that what we obtained is just an expected value of

the gradient, right?

Sum with respect to the objects in theta set,

expected value of the gradient of logarithm.

And we can approximate this expect failure by sampling.

So we can sample one, for example,

point from the [INAUDIBLE] distribution q of t i.

And then put that inside the logarithm of p of x i given t i,

compute its gradient, with respect to w.

So basically what we're doing here is just we're passing our image through our

[INAUDIBLE], to get the parameters of the variation distribution theory q of t i.

Then we sample on point from the variation distribution.

And then we put this point as input to the secondary network with parameters w.

And then we just compute the usual gradient of this second neural

network with respect to its parameters.

And given that its input is this sample t i hat.

So this is just the usual gradient.

We can use TensorFlow to find it automatically.

And finally, this thing depends on the whole data set,

but we can easily approximate it with a mini bunch, right?

We can write it as some constants to normalize things of some, with respect to

minimization of random objects which we have chosen for this particular iteration.

And this is a standard stochastic gradient for a neural network.

So you don't have to think here too much, you just have to find the gradient

with TensorFlow of your second part of your neural network,

with respect to parameters.