0:02

Well, now let's proceed with the loss.

Once we have defined the loss,

we might want to minimize it.

One of the most simple and most widespread method of

minimizing the loss is by a means of gradient descent.

That is where differentiate our loss with respect to parameters w

and change our parameters in the direction of minus gradient with some stepsize of alpha.

Why minus gradient?

Well, because the gradient is defined as direction of the fastest function increase.

The opposite direction that is minus

a gradient shows the fastest decrease of the function.

And this allows us to change the parameters w so

that our loss function decreases in the fastest way possible.

However, to update parameters w with gradients descent,

again we should differentiate the whole loss.

And we know that we actually cannot even compute the loss.

We have only is sample-based estimate.

Well, in fact, what we can do we can

approximate a true gradient with it's Stochastic estimate.

That is we approximate the full sum with one component of that sum.

And this leads us to Stochastic gradient descent.

In SGD, Stochastic gradient descent,

we approximate the full gradient with it's

estimate in particular state and action as a day.

This state in action just in the same way

as we have discussed previously are sampled from

rho pi or rho behavior policy depending on whether we learn on or off policy.

To reiterate, sampling state and action from rho,

is in practice done with picking state and action

from Asian's experience of direction with an environment.

In practice, however, one place is one sample estimate

of the gradient with more stable estimate using a couple of samples,

not only the single one.

For now, we have talked about a gradient but have not showed how to compute it.

In fact, for application of SGD,

we should only know how to compute gradient of L supp s

and a which is a square root difference between the goal and our current estimate of it.

Why are we talking about it?

It isn't the square root difference easy to differentiate? Well, it is.

What is more tricky here is the dependence of goal on parameters w. These goals,

as I have mentioned, are simple numbers in case of Monte Carlo targets.

But there are a function of parameters w in case of temporal difference targets.

And we should differentiate them with respect to parameters w. This is

mathematically correct way to the learning but

it interferes with natural understanding of the task.

If it differentiate goals with respect to parameters,

we'll kind of change the natural flow of time.

That is, we will not only make value estimate of current state and

action look more similar to our target as a cumulative for war.

But we also will make the subsequent estimate of

return be dependent on the previous rewards.

That is not what we want in general.

Thus, we introduce a so-called semi grading methods which treat

goal as fixed and this for any particular type of goal,

that is the gradient of goal with respect to the parameters is equal to zero.

This semi-gradient approach is very similar to what is going on in usual

supervised learning where the goals are almost always fixed numbers.

Bootstrapping methods such as

temporal difference learning are not in fact instances of true gradient descent.

They include only part of the gradient.

Accordingly, we call them semi-gradient methods.

But on the other hand,

they simplify math alot and are shown to work well in many practical tasks.

Let me now summarize what are the properties of semi-gradient methods.

The essence of the semi-gradient update is that it treats goals g(s,

a ) as fixed numbers.

Like gradient update, this kind of

update change parameters in a way that moves estimates closer to targets.

But unlike gradient update,

it completely ignores effect of update on the target.

And because of this semi-gradient is a proper gradient.

It doesn't possess the convergent properties of Stochastic gradient descent,

but it convergences reliably in most cases.

It is also more computationally efficient and faster than true SGD.

Having said all this,

semi-gradient base are meaningful thing to do

because there are a type of parameter of update correspond

to symmetric structure of the task that is the time always goes only forward.

Let now return to the target definitions.

In fact, targets are deeply connected with the algorithm names.

More to say, these targets define what estimate do we learn and thus,

are very important to understand.

In SARSA, our target is current rewards,

that is immediate R(s,

a ) and gamma times our estimate

of actual value function in the next state and the next action.

This next action is basically sample from our policy pi.

5:26

In expected SARSA, we see almost the same update.

That is the same error of s and a.

But now the second term is an expected actual value function in the next state.

Whereas expectation is taken with respect to our policy probabilities.

In Q-learning, we see that our goal is a little bit different.

We have the first term is the same as in SARSA and Expected SARSA but the second term

is a maximum over all actions of actual value estimate of the next state an action.

Note that g(s, a ) in each of

these cases is a a random variable because it depends on the next state S_t+1.

And additionally, on a_t+1 in case of SARSA.

These doubles Stochastity of SARSA target is not a good thing, obviously.

And this is the main reason for why I suggest

using Expected SARSA instead of SARSA when it is possible.

Now, let's look a bit closer to each of

these targets and elaborate a bit on their origins.

You have already seen that tabular version of SARSA.

SARSA algorithm is an example of application of Bellman expectation equation.

However, in Bellman expectation equation,

we do what is called full-width backup.

That is we account for all the possible next states that we could find

ourselves in after committing action a instead of s. And

we also account for all possible actions our policy could take in each

of these states by taking an expectation with respect to our policy probabilities.

In the real world, we could not take an expectation with respect to

model dynamics because we don't usually know the transition probabilities.

However, we could estimate this expectation with Monte Carlo by taking samples.

That is by observing in what state do we end up after making action

a instead of s. We also could estimate the lower level expectation with samples.

That is the expectation or Stochasticity of our policy in the next state.

These are two self-sources of randomness we haven't garnered on previous slide.

Now, look at the approximates or sub date rule.

Well, these approximate SARSA looks pretty much the same as it's

tabular counterpart with the only difference is a loss multiplicative gradient term.

So, cool. Now, you know what is approximate SARSA

and that this approximate SARSA is very much similar to the standard tabular updates.

Well, as I have already mentioned,

the sources of randomness in this SARSA goal is usually not a good idea.

Can we really move some Stochasticity out of the goal estimate?

Well, yes.

And this leads us to the SARSA's closest cousin called Expected SARSA.

If we abandon the approximation on the bottom layer of the backup tree,

we will obtain the Expected SARSA algorithm as a machine

or as a backup is precisely the same as for SARSA algorithm.

The only thing that changes is form of update in the on the bottom level.

Now, it includes the expectation of

actions coming from both pi in the next state S prime.

Contrasted with SARSA where there were no expectation at

the bottom level but only a sample-based estimate of this expectation.

Expected SARSA obviously improves upon SARSA

because it gets raised off additional approximation.

And because of this fact,

I usually recommend using Expected SARSA instead of SARSA whenever possible.

The update of an Expected SARSA is almost the same as for SARSA and

algorithm differs only in the form of goals that current estimates are aggressed on.

Now, what will happen is the policy with respect to which we take an expectation on

the lower level of full-width backup is really one.

That is assigns all the probabilities as

an action which has a maximum Q value of the next state.

In fact, that is precisely the way Q-learning can be derived from.

Let us now revisit our Q-learning a little bit.

Please note that expected SARSA is based on

the Bellman expectation equation while

the Q-learning is based on the Bellman optimality equation.

Again, we witness that two of these concepts are

deeply interconnected and we can seamlessly interplay between

them by making an arbitrary policy pi to

become greedy with respect to current estimates of actual value function.

On the slide, you can see the form of

full-width backup and sample-based estimate of this backup.

Again, what do we see here is that we replay

the full-width backup of Bellman optimality equation with it's sample-based analog.

We do so to obtain an approximate semi-gradient Q-learning.

Also as a form of

approximate Q-learning update doesn't differ much from the tabular setting.

Well, the schema for replacing the full-width backup with

a sample-based analog is almost the same for all methods which we have discussed for now.

And I hope you got the idea.

What if we have discussed for now is a pure theory of different algorithms.

But we have not talked about application for approximate learning

in practice and practice is also very important.

In the next videos, you will get familiar to some of

the most important problems mostly coming from

practical application of reinforcement learning methods.

So, stay tuned.