Well right, so the problem with TensorFlow here is that it can only compute

the derivatives with respect to something that is in the formula.

And if you take a closer look at this particular version of Monte Carlo sample

J, you'll notice that it has no place for parameters, for thetas, here.

The problem is that this sample definition, it does depend on parameters,

on policy, but it depends on them in the summation.

So you have sampled the sessions from your policy.

But you don't remember the values,

you don't remember the probabilities any longer.

So no, TensorFlow won't be able to compute this thing for us.

And in fact, not only TensorFlow, but any mathematician, if you give him the second

formula only, will probably call you a jerk if you ask him to differentiate.

Okay, so we have to do something else.

The answer will go here, we have to estimate the gradient of J.

Now some simple, so-called duct tape approaches you could try to employ are,

for example, the so-called finite differences.

So instead of trying to compute the derivative, you can just pretend that

the infinitely small value in the deficient of the derivative is equal to,

say, 10 to the power of minus five, this epsilon here.

You can compute something that looks like a derivative, but

it's not derivative because the value is not infinitely small.

This will technically work.

It will require you to, well get the J on the compost by sampling.

And then change this policy ever so slightly by this small value of epsilon.

And then find the new value of the policy by sampling more sessions.

Now another way that you are probably more familiar with by now,

is use cross-entropy method.

What it does is it tries to somewhat maximize something that looks

like J by sampling a lot of sessions.

And then taking those in which the J from a particular session was

higher than that of other sessions.

So the expected reward was larger.

And both of those methods will technically work, although they have some problems.

Now this time I would ask you to criticize those methods.

So while those two methods do work in theory, in practice they have an issue

of being very hard to efficiently estimate in any practical situation.

For example,

if you are trying to solve breakout via these finite differences method, it would

actually take you to play say 100 games to estimate this first J plus epsilon.

And then it'll take you another 100 games to estimate the second J.

And even then the amount of noise you introduce by sampling would be still much

larger than the difference between the two policies,

especially if J is sufficiently small.

And if you use large values of J your gradient would be useless for

anything more complicated than a linear model, or maybe a table in this case.

Stochastic optimization, like the crossentropy method is in this case much

more practical in terms of how to use samples.

But it still has some problems.

For example, remember, if you have some innate randomness.

For example, you have a slot machine in your environment or

there are some physically random process.

In this case you'll have to use your crossentropy method with some tweaks to

prevent it from favoring the lucky outcomes, from believing that the elite

sessions are elite because they have some way of tricking the slot machine.

The method we are going to study next will mitigate both these problems by the way it

computes the derivative of J.

It won't use any high wizardry.

Instead it will try to find an analytical derivative

of J which is easily approximated with sampling.

[SOUND]