0:00

[MUSIC]

Now the only issue with method that we have left untackled yet

is how do you apply it efficiently to the problems at a greater scale.

Now with the bicycle-riding problem, let's say ten states and maybe four actions,

it's really probably sufficient to solve it.

But any reinforcement learning problem can do that.

Let's see how method actually applies to something like steering

an autonomous self driving car, or playing a game from a feedback.

From here, that it's no longer possible to store a table with

action probabilities for all the possible states, because you don't have ten states.

You have either say, a bajillion states, or

even continuous space of states which is even technically impossible to record.

If you can see direct camera feeds from your robot or from your game, then you

will probably find out that the amount of possible camera, well, camera images,

is as large as color depth, say 256 to the power of the amount of pixels.

So if it is even 100 by 100 pixels image, which is super small,

you'll have 256 times 10,000, which is insane.

You won't be able to record this on any hard drive error,

or maybe you will in a few years.

Now, not only storing this table will be impractical or

even impossible in practice.

It would also be kind of inefficient because of how state spaces

work in complex environments.

Now we mentioned you're playing this Doom game, and you have this camera image.

And it's probably kind of reasonable to expect that if an agent visits this state,

say, ten times, it will reliably tell that turning to the right and

flying subsequently is a reasonable strategy to execute.

The only problem here is that if you change this state ever so slightly,

say you change the health bar of a person from 59% to say 49.

Then, the agent will have to learn everything from scratch,

because it's now a different state, and has not yet seen it.

1:53

Of course, it's not the case for humans.

Humans can easily generalize, and

this is a property you want from your reinforcement learning algorithms as well.

This is where we approach the so-called approximate reinforcement learning.

This time we don't use the tabular policy.

We don't just record all the probabilities explicitly.

But we use some kind of machine learning model, or

well, any model you want, to model this probability distribution given the state.

For example, it could be a neural network.

Basically a neural network that was previously used for

classification that will take a state, an image, maybe apply some computational

layers if you're familiar with image recognition with neural networks.

Then it would have output with a soft max layer with as many units as you have

actions.

Could also use any other method.

It could be regression, any other linear model, or a Random Forest.

Anything that can run the simulation for you.

And since you can no longer set the probabilities or

update them explicitly after each iteration,

you'll have to replace this phase with training your neural network.

Again, you play M games in this Doom environment.

Then you select M best of them, you call those elite games.

And then instead of recomputing the whole table, you simply perform

several iterations of gradient descent or building several new trees.

Well, let's consider the neural metric option for now.

In the case of neural network,

what you do is you initialize this neural network with random weights.

You probably remember some clever ways you can do so from the deep learning course.

And you take this network and use it to pick actions,

to actually place a 100 games in this Doom environment we had in the previous slide.

Then you take again the best sessions, several best sessions, and

you cold denote them as elite sessions.

And you train your neural network to increase the probability of actions in

the elite sessions.

You do so the way you usually train neural networks to classify.

You have states which are the inputs of your neural network, and

you have the actions in the session that serve the purpose of kind of the correct

answer, the target why, in the supervised notation.

Now you take some kind of stochastic [INAUDIBLE] algorithm say [INAUDIBLE], or

stochastic gradient descent or [INAUDIBLE] proper anything.

And you perform one or several iterations of network updates given those M

best sessions, then you repeat the process.

4:07

Now, you can of course achieve it in high-level frameworks, and

it looks remarkably simple.

Having neural network, and whenever you want to update the policy provided by this

neural network, just call whatever the training methods requires you to call,

say the feet in cyclic learn.

And you provide your elite states as inputs to your network, and

elite actions as reference answers.

Or you can get something more sophisticated up and

running if you utilize the particular feats of your neural network architecture.

But this is the most simple,

kind of the bare bones algorithm that will still work and work rather efficiently.

You'll probably see this guy in action in the practical assignment.