0:00

So, once we're done fighting those fire-specific digital architecture,

I want to direct your attention to the elephant in the room.

This elephant that we have promptly ignored

since the second week of our course and right until now.

The elephant is the fact that in almost any practicals reinforcement learning problem,

the environment doesn't strictly abide by the Markov Decision Process rules.

The main issue for now is the fact that,

in almost no case,

your agent will have a direct access to

the environment's state like it was supposed in the MDP.

The environment states in MDPs basically over is to know about the environment.

Strictly speaking, if you're navigating a robotic car through city streets,

then you not only need to know what's the camera image of this robot.

You need to know the exact positions of all your surroundings, and quite frankly,

you have to know all the properties of

all the quantum particles in the known universe because,

technically, that's the only possible scenario

for which the Markov property holds exactly.

So, it's an issue that you cannot solve exactly.

It's an issue that you cannot simply ignore

because there's loads of stuff that prevents you from learning optional policy.

So, we have to somehow mitigate the fact that our agent's observations are imperfect.

For robotic car, we would very much like to know what's happening right

behind us even though our camera might only be facing forwards.

For example, we want to take into account the fact

that if there is someone beeping behind us,

then there's probably a car even though we might not see directly.

The same is true for any other practical related situation.

For example, if you're trying to trade your stocks or anything,

then you might benefit from knowing the history of how this asset traded over last month,

not just this current variation.

Finally, within Atari games,

you don't know a lot of variables like the velocity of objects on your game fields.

This brings a lot of complications even for a usual DQN.

The complication that we have mitigated through duct tape so far.

Now, to mitigate this issue,

the first thing we have to do is we have to redefine the way our decision process works.

The usual Markov Decision Process looks like this.

We have an agent and our environment,

and environments sends state to an agent,

which in turn gives his policy to predict his action,

and the action gets fed back into the environment to get the next state.

We assume that there is some probability of getting to the next step ST plus 1,

and up thing or our work r given

the particular state and action we were at the previous step.

We probably assume that in any practical case,

we don't know the distribution explicitly,

but we can estimate it from samples if we wish.

[inaudible] algorithm is mitigated

But nevertheless, since well,

we don't know the probability distribution,

we do have direct access to those states.

We know the state can devise our policy based on the state.

However, the situation will look much closer to this scheme here.

Don't freak out, it's always a bit more complicated.

The difference here is that while in previous situations,

the agent received environments state directly here,

now we have this observation function,

the O of S on the left here.

The observation function is basically some function which limits what agent can see,

and what he cannot see.

So, there is a hidden state in the environment,

this S here in a circle,

and technically it exists.

There even is next stage you include it in the next distribution.

But you never see not just the probability distinction,

but the entire state as it is as well.

So, you don't get to see the state you only see a sequence of observations,

and you may somehow judge

what happens inside the environment based on those observations,

but it's the best thing you can count on.

So, to actually solve this new process called,

Partially Observable Markov Decision Process because of this observation function,

we need to introduce something else to an agent,

which helps him operate in the situation.

What would help you to for example,

not forget about the person that you are not seeing right now directly?

If you've just looked away from a person,

what would help you to still keep him in mind?

Well yes. What we want to introduce is,

there should be some kind of agents persistent memory.

The memory cell where he can store some information between iterations.

So, there is some kind of this hidden variable h,

which is yet another vector or any other set of numbers, and on every iteration,

agent can update his memory,

his new h given his observation and the previous memory.

There is of course a bunch of different ways you can actually implement this memory cell,

and we'll discuss them in just a few minutes,

but so far let's find out why this thing is useful.

Now, one way you can think about it is that this memory state,

the h here is a tool for an agent to

approximate what's happening in the hidden environment state.

This S which is now hidden under the blue cloud at the bottom.

So, there is some hidden variable you could use your h to

learn how to reverse engineer this hidden variable from from a sequence of observations,

and then use this our such new variable to alter your policy.

For example, if you have a robotic car example where currently,

you're approaching say, a traffic light,

and in this case,

your current observation is just the current state of this traffic light.

It's either blue or whatever.

green or red or yellow, in most cases,

and let's say it doesn't have a timer or any other additional information sources.

Now, for your usual MDPs agent,

this information is insufficient for many situations.

For example, if you know that the traffic light is green,

but it's going turn red in just say,

three seconds, then you want to speed up to get

past the traffic light before switches to get to your destination faster.

That's where our memory comes into play.

Basically, your memory is updated based on the signals of observation.

So, your agent can learn to, for example,

count the amount of seconds since the traffic lights switched on last time, and this way,

it can basically introduce

information about what's going to happen next because

it understands how the hidden variable,

the hidden traffic light timer operates and basically reverse-engineers it.

It seems true about a lot of other cases, and of course,

you won't get a drastic increase in performance just

given this one ability to recognize traffic light properties.

But if you add up all the influences for the situations

that you can kind of reverse-engineer with this memory,

you'll get a huge boon.

But of course, this memory is only useful if an agent can effectively operate with it.

In fact, there's one thing which technically qualifies as

memory although it's not learned or anything, it's not that complicated,

which we use for a usual Deep Q-Network to work with Atari efficiently,

to be able to get some information about object velocity here.

What kind of memory was this one? Well, yes.

Basically for the Atari games,

we just use the frame buffer heuristic.

Basically, we said that we cannot get the state variables exactly,

but we can get almost everything we need if we just stack say,

last four observations or any other amount of observations because it's useful.

Now technically, this gives us all the information we need for Atari,

but it has a number of flaws.

For instance, this week,

we cannot remember anything which happens

more than four turns before this particular step.

If you want to monitor something more complex,

then four turns is not going to be enough.

In Atari, the effect of this heuristic is so

great only because most of

the hidden information just velocity and maybe acceleration for objects,

which are traceable from two and three terms respectively.

So, the architecture we used for the Deep Q-Network

with the frame buffer was basically this neat scheme here.

The difference between the one-frame DQN and this one is that we have the frame buffer,

which contains four images: the image for current time frame, previous one,

the one before previous one, and so on,

and together they use to estimate the kind of the motion,

the dynamics of things via

all those convolutions that they are fed into as different channels.

Now, this kind of stuck here,

the [inaudible] Q here,

the first in first out structure is in a way a simplified agent memory.

Of course, its not again, that complicated,

but it's something which persists between the iterations.

Technically, it solves our problem to some limited extent.

However, there's a much more powerful approach here.

Well, the overall idea is that we're trying to train some architecture,

which assumes that there is some human process there, basically,

there is hidden process of the environment state,

and that you can only see some observation,

some visible part which is not entire process.

You want to draw on those hidden state.

There's actually one architecture in

deep learning which works with these exact assumptions,

and use them rather well. What I'm talking about?

Yeah. Recurrent Neural Networks.

Of course, there's a bunch of those guys,

but generally, here is that you have a vector of numbers,

and you learn a transformation which transforms previous vector of numbers,

your previous memory state.

Your current observation, your current

time frame in a tally or anything into a new vector of the same amount

of numbers so that you can then apply

this transformation iteratively to oneself for as many traces as you want.

So, now I'm just going to repeat you all the information you've already been

taught at the very first course of advanced machine learning specialization.

Namely, The Introduction to Deep Learning.

There, at the final link by Caterina Lebachaure,

we've been taught about how recurrent neural networks operate and how they're trained.

You basically initialize those weights here,

the blue squares with random values,

and then you simply apply this transformation to basically oneself,

as many traces as you want.

For example, you can take your observations from 10 steps ago.

So, 10 Markov decision process tree which is before you get this observation,

and you feed it into your recurrent neural network from bottom,

from this yellow triangle.

You initialize the initial hidden states,

the old state here on the picture, at say zero.

Some fixed value that defines that the network has no prior information.

Then you simply apply the first arbitral of

those weights and then the second arbitral feeding now the next frame.

So the first one was 10 traces ago,

this one is nine traces ago.

Then third frame and fourth frame and so on until you

get the current frame where now you're instate,

your age depends on all the previous frames starting from minus 10 frames ago,

or potentially for as many frames as you want.

You may start from say a million frames ago,

it would only take you say a few years to train.

Now, after this whole process,

your final hidden state is used to evaluate the Q function.

Just as usual, you could use

either usual Q-learning or maybe dual link Q-learning or any other hack if you want,

to train your network to predict Q values

with temporal difference error just like before.

Now, a closer look to those formula would reveal the fact that the only thing that

changed since our usual DQN is that we normally depend on states directly.

Whenever we have the state here is replaced with this O of S,

the observation of the state.

Instead of taking just one state, the current ST,

we instead consider all the states from say ST

minus 10 or some state in the past till this current state.

So, it might be a huge sequence if you're going to train it for enough time.

The way it does so, is by learning this recurring formula.

So basically, the Q function is as usual

just a dense layer with one unit per action in your activation.

Q depends on each T,

which depends on HT minus 1, HT minus 2,

HT minus 3, yada yada yada until HT minus some fixed amount of time,

which you've decided to stop at.

In which HT in turn depends on it's observation.

So, HT depends on observation of ST,

HT minus 1 depends of observation of ST minus 1.

This is how it works, it's just a huge differential formula.

Now, this formula has some parameters, namely those weights.

The weights, the blue matrices here.

The weights from previous scheme state to initial state and from input to Newton state.

How to train those weights? How do you actually tune

them to make your Q function as accurate as possible?

How do you do that?

Yes, you back propagate.

While this formula might scare all the courage out of you,

it will most definitely be much easier a job for

Tensor Flow or Tiano or PyTorch and any other automatic differentiation framework,

which will just take the formula and then just

T of gradients of it to get the necessary gradients.

Then you can use Adam or LMS prop

or any method you prefer to tune the weights just as you did for the convolutionary work.

So here's how it happens.

Unless you're going to use some huge time frames or train it very extensively,

it will to some extent learn how to use the previous states.

It will learn to remember some useful information and forget the useless one.

But recurrent neural networks have a lot of nasty properties.

For example, if you train neural networks for a long time spans,

if not 10 but say 100 previous stages,

which might make sense in a lot of situations,

there are two problems; gradient vanishing and gradient explosion.

Gradient vanishing is when by multiplying those gradients,

you run at risk of getting something very close to zero.

Because if just one of the products of those DHT by DHT minus 1 is getting near zero,

then the entire product is going to be close to zero as well.

The opposite problem is the gradient explosion,

which is when you multiply a lot of logic against and you get

some cosmic shift in your ways which basically throw them on the other side of floor 32.

Those problems employ other training quite a bit,

so to fight them you know a lot of tricks and a lot of

architectures that have some kind of workarounds there.

For gradient inflation, there are LSTM and given the current units.

For gradient explosion, there's clipping.

You probably know more than one way you can do so.

Actually if you're more into theoretical deep learning,

you also know that there is unitary neural networks,

unitary current neural networks,

that have neither of those problems by construction.

So here's how it goes.

You simply introduce new architecture which has to be trained not just on SE Arus prime.

So it has to be trained on observation,

action, rewards and nice observations.

But this time it has to access all the observations from

say 100 or 10 steps ago till now and all the current neural network.

Here's how it's going to work. Now, there's

one very popular implementation of DQN with this current neural network,

which differs from our previous picture in just a few ways.

First, the significant current neural network recuses LSTM because of course it does.

Since LSTM is basically the version of RNM which

doesn't suffer from variation gradients and has

all those nice almost interpretable properties with forgets,

updates and so on,

and as usual, it just takes the output of LSTM,

the kind of public recurring state,

the non-cell, the H of LSTM,

and it computes Q values densely based on those guys.

Okay, like I've just mentioned,

you have to train this network in a special way.

Just so you can simply assemble trajectories.

You can sample not just single SAR sprantuple but subsequents

of those tuples that come one after another.

Here's when one problem occurs.

The problem is that if you sample those trajectories in a special way,

then you no longer get independent and identically distributed data.

So, technically you're sampling your new make

optimization is going to be slightly less efficient in this case.

Sometimes those DRQN are even known to diverge

because there's so much that can go wrong and something eventually goes.

So, basically if you compare the DRQN

vs the usual DQN on the known benchmarks, you'll get something like this.

Sometimes it's better, sometime it's way better.

Like here. But in some cases you can also see

that the DRQN is not better but also worse than the original DQN.

This is because it is much harder to actually train,

much more complicated for conversions.

We'll also study some tricks to improve this performance

during the next week when we use policy mixed methods.

Because there is a method which is specific and very convenient to them which

also solves the NRM problem just as a side quest.

Until now, you can still train DRQN with experience and play with some efficiency.

So here's how you mitigate the problem of POMDP,

the partially observable decision processes.

Of course there's much more to it.

There are special architectures like

the deep neural network equivalent

of planning model that allows your agent to think proactively.

There is a lot of cool stuff when you have a model based planning.

Room to reduce all the flinks about it in the reading section so that the

curious of you would have their curiosity satisfied. Until next week.