0:02

This is the kind of question you probably asked much more

frequently in the final years of your school education.

You have this logarithm of

some arbitrary function f of x and you want to get the gradient,

the derivative of this function.

And I want you to simplify the derivative for me by applying some of

the properties of derivatives you've studied

previously in calculus or finally as a school education.

Yes, if you first apply chain rule and then try and then replace the gradient of,

well the derivative of logarithm with the appropriate derivative from the table,

you will see that this thing is equal to one over f of x,

times the derivative of the f of x itself.

If you multiply both parts by f of x,

if you just move the one over f of x to the left,

you see that this thing is equal to the following equation.

Have the derivative f of x on the right hand side,

which is equal to the function itself,

f of x times the derivative of its logarithm.

This is a very simple trick, it's called the logderivative trick.

It's actually a very powerful one.

Let's see how we can use it to simplify our lives.

So, you have this nabla J from the previous for like three slides, six slides ago.

And we're now going to plug the logderivative trick this equation directly

instead of the nabla of probability in this function of a normal distribution.

This gradient N theta mu sigma squared.

So, if we replaced five,

the right hand side of this formula below,

you'll see that this nabla J becomes equal to this formula which is even longer,

but as it turns out much simpler.

So, this is an integral over the probability from normal distribution of

thetas times the gradient of the logarithm of this probability here,

times the expected trajectory return defined by the second expectation.

Now, can we tune this particular format?

Can we adapt it to estimate the nabla J in a sampled manner to speed up,

to make the computations tractable in any practical environment. What do we do?

I'll guess. The easy part here is

that once we have the integral times of the probability density function,

we can pretend this is an expectation gain.

And this gradient of the logarithm is whatever is expected in

this expectation and be both

simplified for it becomes like this thing here on the top of the slide.

So, if you now replace the expectations with samples,

you'll see this guy here below.

And again it's more of the same thing but you multiplied by

the gradient of the logarithm of the normal distribution probability density function,

which you can explicitly compute via, well,

even do this on a sheet of paper in like 10 minutes,

or you can try to google it,

or work from it to whatever.

If you remember, TensorFlow or Theano or

any other symbolic graph frameworks from the deep learning course.

You also know that you can get the value of

this gradient explicitly by simply using the symbolic gradients.

So, here you go, you have a sampled estimate over

the simple estimate of a derivative of the expected return,

respect to your mu and sigma squared.

So, once you have finally obtained a way

to estimate the nabla J in a sampled based manner like this formula in the slide.

It's time to use this formula to devise

an actual algorithm which is capable of sorting and reinforcement during task at hand,

from kind of from scratch, from zero to the convergence station.

The algorithm's going to be very similar to what it had before.

And you will start with some initial guess of mu and sigma squared.

Let's see that the initial guess provided us with

this sample of green dots at the bottom center of this slide.

And then again, intuitively adjust it using the feedback,

the reward function of a trajectory defined by

this hew with a blue kind of easy clients on the right.

Basically, this is a simple type problem to illustrate how the algorithm performs.

So, you have a mu and sigma and once you compute the gradient of

the expected return which is slightly better on the upper right part of the samples.

If you can predict gradient of this thing with respect to mu and sigma squared,

you'll see that by following this gradient,

you'll move the mu upwards and to the right by

a slight degree using the learning rate you have for the gradient descent.

Still slightly, it will crawl upwards by a small step.

Then, if you repeat the whole process again,

you get even better estimate,

and maybe even better estimate, and so on and so forth.

Each time you draw new samples from the normal distribution,

say while say, 50 samples per iteration,

which would be more or less a key.

Three sampling a blade one or several gains and average their performance.

Now, eventually this thing is going to crawl uphill and find itself in the maximum,

maybe a local one here.

And since it's a probability distribution,

it's not yet the final stage,

because there's one more thing to change,

that you still can adjust the derivative of the sigma squared.

Adjust the sigma squared to minimize the loss given by the variance.

And this uses to this segue of crawling uphill the sampled base manner,

which will eventually find itself in a more or less delta function in the optimal point.

And by eventually, of course, they mean,

if you have an infinite amount of iterations,

and infinite amount of samples and so on,

because in a practical case,

you can only guarantee that it's kind of more likely to be

or in some vicinity of the optimum as you progress.

The formal step by step definition,

the algorithm looks very similar to whatever you had before in the cross-entropy method.

And basically, it shows that we first have to,

yes, the initial values for mu and sigma squared.

These are both, maybe large vectors.

You can try either using some generals,

a mu of zero and sigma of a small number

for every possible parameter in the entire array.

Or you could use a prior node issue, for example,

pre-train your neural network on some, well,

supervise the data or on some similar environment and use it as an initial guess.

So, once the initial mu seen with the taint,

you then get the new sample trajectories.

You sample mu, you sample theta from mu sigma sample trajectories and this theta.

And you use those trajectories to estimate the use

of the expected tretron using the formula we've just derived.

Now, once we get the formula here,

we use it to perform a gradient ascent.

So, we want to maximize the nabla J,

we say that our new mu is the previous mu plus aIpha,

some learning rate, say 0.01 times the derivative of J with respect to this mu.

And the same works for sigma. If you replace

those derivatives with a more ugly fixed stuff,

this actually means that once you've sampled the, well,

thetas and first theta your sample trajectories and returns,

then you can just break down the VGU which is non-analytical manner,

find a formula non-pi or any other, well,

write a formula non-pi or any other favorite library of yours.

And basically you repeat the process until you're satisfied with the result.

I've written a formula for

a one-dimensional mu here but you can easily extend it for anything.

And for sigma squared, that small is the same thing,

you have the derivative of normal distribution.

And since normal distribution is something times the expanse of

the square difference between mu and the particular sample,

the logarithm of this normal distribution is even easier to compute.

As you can see right now, the new definition

of initial strategy algorithm is not that simple.

If it's from a single slide now of the large font size.

As I promised to you, this definition is also quite

decent from our usual notion of evolution and,

well, natural selection yada yada yada.

It's rather a trick from domain of stochastic optimization methods.

And it is true for

any stochastic optimization method or

any reinforcement early learning algorithm for that matter.

We can slightly improve it by applying duct tape,

the usual dose of hacks, heuristics, and tricks.