Let's now focus on how to use this ideas to actually learn the core function.

You can only now have access to trajectories and

not the whole set of possible states and action transitions.

Now, the most obvious but not the only approach you can try,

is you can sample the trajectories,

and then average over all the trajectories concerning a particular state.

So sampling breakouts.

And in this case,

let's say you've managed to sample a million sessions.

And of these million sessions you get 10 sessions where you are in

a stage where you're on the right and the ball comes from the left,

and see for that part of the brick wall is

destroyed and all those just keepers of the state match.

One of those states you might be lucky and you'd be able to bounce the ball off,

and hits two bricks before it bounces back.

Now, in another thing you might be less lucky you've probably missed the ball.

Sometimes you do because your age might be not perfect.

And you've probably lost a life and your average return for this session was lower.

You might have already gotten to the stage where you are super lucky and

you're able to boss the ball

behind the brick wall which gets you a lot of points and breakouts.

So if you're not familiar with rules of breakout,

just believe that there are many possible outcomes.

The idea is that for this particular state you have to average over those outcomes to get

an estimate out of this expectation of

possible outcomes that you get in the dynamic program formula.

Now, the niche here is that,

it does work technically.

The statistics say it isn't the best estimate and

it will converge provided you give a lot of samples,

but this amount of samples in practice might be unrealistic.

So while I have probably said in this example,

you take a million samples,

and out of a million samples that might be just 10 where you've ended up in

this particular state with this particular ornament of

bricks and your position and the ball position,

otherwise they are all just a little bit but they are different states nevertheless.

So again, if you're having a simple problem where you have not

that many states where there are some special exotic case that you can exploit this idea.

And if you are more or less,

basically if you know what you're doing,

you can apply this idea.

You can simply sample it out of times and average,

but we're actually going to find out that there is a better way,

a better way in terms of how it converts and more spectral problems.

So here it goes. Secondly is called the temporal difference.

The idea is that it's going to exploit the recurrent formula for Q values.

Have the Q value, and this left-hand side is what you currently know about the Q value,

while the right-hand side with the expectation of all the possible states again,

is the kind of free finds direction of Q value that you

would have used in your iteration,

in your dynamic program based methods.

What you can do is you can use this idea to

use your current estimate of the Q value on the right-hand side,

the maximum over those Q values,

to refine the Q value at a different stage.

And to do so, you only need

a state action or a reward at the next state. That's how this goes.

If the thing is clear let's

actually highlight some of the details that are hidden in the formula.

So you have this optimization term here,

you've probably noticed that this notation for Q Star,

or the action value for the optimal policy.

Since the optimal policy would take the action with highest Q value,

you would kind of replace the value

of the next state with maximization of all possible actions.

And then you'd obtain their current formula for your Q star.

That's more or less like copy paste from the previous week.

So another problem with this formula is you may have noticed this expectation here.

Basically it requires you to expect over the outcomes and so on.

In this case again,

we cannot compute it explicitly.

So, we have to do some other trick instead.

Which one? As usual,

the only way we can so far get away with something we can compute explicitly,

if it's an expectation we approximate it.

We approximate it by sampling.

In this case we can take C5 samples,

or its trajectories it may be the only option we

have to take one sample because there's no going back.

And we take this one sample from

our environment by blending in the action and seeing what happens,

to somehow estimate this expectation.

But we cannot use it as an actual Q value because it's so noisier.

The other important trick we can use here

is the so-called exponentially weighted moving average.

The basic formula is like this,

instead of this Q function by this noisy but unbiased value entirely.

So instead of assigning this noisy value to the Q function,

and forgetting everything there was before we updated the Q function.

You can use this smoother version of the update.

So take your updated noisier version,

this is the left part of the right-hand side.

Sorry for the description. And you

multiplied by say point one, this is the value of alpha.

So you take point one of your new noisier estimate,

and point nine one minus point one times the previous estimation you got.

So initially it will be a point one times your new estimate

plus point nine times zero because you start with zeroes.

Once you've got sum of the new estimate of Q function,

this moving average equal will prevent you from being noisy.

It will allow you to gradually converge to

the average because every time you take one sample,

you trust it on a little bit and you kind of average over time.

So, this is how you basically solve the problem of only one choice, only one trajectory.

So now you have this other form step by

step to give you a better idea how to implement it.

And spoiler alert, the implementation is part of your homework,

so please least your attention there.

What you require here is as usual have this trajectory of state action reward,

next state, next section so on.

And to train via Q learning to perform

one elementary update you only need a small section of the trajectory.

Then you need a state, an action,

a reward for this action the state immediate reward,

and the next state sampled from the decision process.

So these are in the s, a, r, s prime,

and to utilize them you simply block them into the formula we've just obtained.

So, first we need to obtain the value of the next state.

We want to find the Q value of next state part of the formula.

To do so, we have this next state,

we would probably know the set of all possible actions available because

that's how the set is usually a constant that we know it in advance.

What you do is you take all the known actions,

and we can put the Q value of those actions.

Since we trying to compute the Q star,

what we do is we take the maximum here.

So basically we take Q of s prime a1,

s prime a2, s prime a0 all the available actions.

And then we maximize.

This is basically the v star of the s

prime if you wish the mathematical way of saying it.

Now you multiply this by gamma,

because that's how these counter trays works ,

and you add the immediate reward.

This way you obtain one sample from

this expectation that was a part of our original formula.

Now you can update your Q function by getting a little nearer to this new sample,

via the exponentially with the moving average,

using the plug in the formula as usual.

So you have this reward plus gamma times the maximum over Q values for the next state,

and then you multiply it by say point one,

and after that you take your previous guess and multiply it by one minus point one,

point nine, add those things up and that's your new estimate of your Q function.

This algorithm has a number of super interesting consequences.

First, this session from partial experience from only this tuple of s,

a error spring, you can train your algorithm even before you've ended the first session.

So meaning you're training your loop or less loop to walk forward.

And what we should do is you're going to reward him from simply

moving forward even before he or she reaches this five minute threshold of a session.

What's going to happen then is mention that your Robert is well,

starting randomly, and regardless of what he does he's probably going to

end up with some kind of cycle because that's how walking usually happens.

So, even before it finishes going forward,

he's probably going to find itself in the same state at

least a few times per loop maybe once per loop.

So once per step,

you're going to find yourself with a position where say

your right leg is set up and you're about to make the step or something similar.

So even before you finish your very first trajectory,

you're probably gonna end up with something that is better than random of those Q values.

And this is super neat because sometimes you have infinite processing,

sometimes you want to walk forward indefinitely or you want to

optimize some ground forces with no apparent border for session.

And you can just feel it into Q-learning,

and Q-learning is going to train to in converse the optimal policy,

before it finishes and 12 without finishing even. How cool is that?