0:02

This time, deducting going to be of a different blue color

because it's not technically heuristic.

It's rather a better way to

define the reward function adaptively as your agent progresses.

Imagine you are trying to train a bipedal robot to walk forward as fast as possible.

So, you have a robot with a ton of motors in each unit that they can move independently

and you give it a reward for the amount of distance it

managed to cover in say, 10 seconds.

The distance is measured in meters,

so you get a plus say, 10 for 10 meters.

So, in this case,

if you start training from a random guess,

you probably find out that for first several iterations,

the reward therefore is going to be somewhere between say,

zero and one because

the initial session is going to consist of robot falling down his face,

and maybe crawling forward by sporadic,

chaotic movements of his limbs.

So, this is how you're going to begin.

But eventually, as you hopefully converge to some kind of optimal way,

robot looks upright, the theory was of the plus 100,

maybe even more if your robot find a way to run fast.

This basically, it doesn't technically break anything.

But, it does make you adjust the learning parameters of your algorithm.

For example, if you find an optimal learning range for the beginning of your algorithm,

the [inaudible] will probably be an overshoot at the or interference where the rewards are latched.

So, instead of trying to adjust learning based on the fly, you can try a different thing.

You can try to use, redefine your rewards based on this simple equation.

So, if there is something that you subtract from each reward,

this some kind of constant value which doesn't depend on a particular session,

or multiplied by something core, or subtract something,

then maximizing the initial reward is exactly,

it's going to give you exactly the same solution as maximizing this new reward.

If your algorithm converters turn often, of course.

This is the same sequence that happens if we use the,

if we modify the last function for supervise learning.

This time we just normalize the rewards.

And to normalize, we can compute the median and standard deviation

based on either several sessions within one iteration.

So, if sampled, say a hundred sessions,

we get the mean reward,

the standard deviation via maybe your favorite numpile-like package,

and then you subtract one and divide by the second.

But, the session which is the best of

those 100-session batch will still get

the highest award because that's how arithmetics work.

So, this new reward,

which is the reward minus mean divided by standard deviation,

those rewards is usually referred to as the advantage.

How your algorithm performs compared to

the mean performance of other attempts within this iteration,

or well, other algorithms depending on how you define the,

involved the events function here, how you calculate the mean.

And basically, it has the same properties as,

it gives the same benefits as when you train a linear model when

you normalize the feature there as an input.

So, here it goes. You have a modified version.

Just block in this advantage instead of

every occurrence of rewards in the previous slide like this,

and then you'll get an algorithm which hopefully converts slightly faster.

We'll get a much closer look at this advantage function,

and its benefits and drawbacks in

the later weeks for our course when we cover the policy based methods.

So, here it goes.

Now, we'll see how the evolution strategies compares to

other algorithms we've already covered and the ones we are yet to cover.

It's again, yet another black box characterization trick.

So, the huge upside

of the evolution strategies algorithm that it's super easy to implement.

And since a single iteration of the algorithm is very simple to implement,

it's also kind of trivial to parallelize.

Imagine you have, imagine you live in

the modern era and for any research you have several CPUs,

say 1,000 CPUs that can compute things independently,

and you want to use all,

the whole bunch of CPUs to estimate this formulae.

Let's say that we take well,

a hundred samples for theta,

and for each sample we play 10 games with this particular theta to cover for the,

in say like a CCC of this second sum in the formulation.

How can we compute this formula explicitly without losing a lot of

time because of the sequential nature? Probably, right?

You can sample a thousands,

say a hundreds values of theta,

and then send each theta to 10 cores in your

1,000 core cluster so that this core would be able to compute one trajectory.

Then for attempt computing just one trajectory you would get

a number of trajectories equal to your number, of course.

And this gives you a linear improvements of scale of 1,000.

This is super great because more complete algorithms usually don't scale that well

because they have a lot of sequential paths that cannot be easily paralyzed.

So, but here's an upside.

Now, we see some of the results of how

this algorithm works in the practical environments.

This is a report by OpenAI.

There's also a very well written blog post that you can read,

which we recommend you to do right now.

This probably be a pop-up with a text right here.

So, here are the training costs,

the rewards per the number of iterations the algorithm for

each of the three games they have tried the algorithm on.

There are of course much more experimentations in the black box,

they can find that yields slightly different although generally similar results.

And what you're going to find here is that the evolution strategies,

the one that's the orange line here,

is usually almost as efficient as this other method TRPO.

Now, TRPO takes almost an entire lecture to explain,

and in some cases it's even less efficient as we can see in the very first plot here.

But, if we can parallelize our method,

we can scale it to multiple cores,

it'll be much more efficient than TRPO because of the all well,

computation power you haven't had.

And you cannot do the same parallelization trick with TRPO as efficiently.