0:00

Now after we talked about stochastic approximation,

it's time to talk about the celebrated Q-Learning.

One of the most famous algorithms of Reinforcement Learning.

This algorithm was suggested in 1989 in a PhD thesis of Watkins.

And since then, his paper was cited more than a thousand times.

Q-Learning got its extensions are used in

many interesting applications of reinforcement learning.

In particular Google's DeepMinds,

became very famous before they became a part of Google,

when they published the paper where they showed how to use Q-Learning

at scale to teach Reinforcement Learning agent to play Atari video games.

Now, in its original form as suggested by Watkins,

Q-Learning works only in a setting called discrete states and discrete actions.

In this case, instead of continuous valid Q-function,

it's represented as a discrete table,

with one value of the Q-function for each combination of states and action.

In this case, we can say that the Q-function is given in the tabulated form.

Now, the main statement of Watkins Q-Learning,

is that given enough data,

the algorithm that takes old data point

sequentially like in Mel Robbins one Draw stochastic approximation,

converges to the true Q-function asymptotically when they have lots of data.

More specifically, the convergence proof

assumes that each possible combination

is encountered in data an infinite number of times.

Of course, this is never true in practice,

we're all data are finite.

Therefore in practice the question of

numerical convergence is always something that needs to be checked.

So, what does Q-Learning do?

It simply takes the next observed transition and updates the value of state action PR,

that was observed in this transition.

Let me illustrate it's working on

some popular examples used for testing and teaching Q-Learning.

One such example could be an application of Q-Learning to solve a maze problem.

Here you can see an example of

a simple maze with an obstacle as shown here as a black square,

and then exit in the top right corner.

The problem of an agent is to learn an optimal policy that would

prescribe the direction of the move given the cell location of the agent.

In each cell the agent can move up,

down, left and right.

So, we have four degrees of freedom,

but some most cannot be done.

For example, a left move is forbidden if there is a wall to the left and so on.

So, in this problem we have 11 possible cell locations,

and up to four possible actions that can be taken in each cell.

And this gives us 44 possible combinations of a cell and action,

which is still quite manageable number to keep as

an index column in the lookup table that stores

the value of the Q-function for all such possible combinations.

The Learning in such simple environments can be done using simulations.

Each time a particular data point is absorbed as produced by such stimulation,

it's taken to obviate the value of the Q-function at the node that

corresponds to a particular combination of the state and

action that were observed in this round of simulation.

Now after we spoke about what Q-Learning the does,

let's talk about how it does it.

In essence Q-Learning is just the application Robbins-Monro stochastic approximation.

But this time to estimate the unknown expectation that

arises in the right hand side of the Bellman optimality equation.

Let's compare the Bellman optimality equation with the Robbins-Monro update.

In the Bellman equation,

the optimal Q-function is given by the expectation of the right hand side.

On the other hand, the Robbins-Monro formula,

just shows how to update Iranian estimation of the mean.

Therefore, we can take an individual sum of the current observed reward RT,

and the marks of the next step optimal Q-function,

as the current observation Xt in the Robbins-Monroe formula.

The role of the current estimate of

the mean X T will be played by our current estimate of the optimal Q-function.

If we specify some scheduling scheme for Learning grade alpha K,

these substitutions will produce

the famous Q-update rule of Q-Learning that is shown here at the bottom of the slide.

In words it says that the new update for the value of Q-function the node Xt and At,

is given by previous estimation of its value scaled by the factor one minus alpha K,

plus another two are equal to alpha K times the new observed value of

the quantity that appears in the right hand side of the Bellman optimality equation.

Now the Q-iteration rule shows two very important things about Q-Learning,

namely, that it's both a mortal free and off policy algorithm.

This means that it does not make any assumptions on it through

data generating process that produces the current observation.

It simply takes it as given,

and updates the action value function for a given state action node Xt and At.

It's also an off policy algorithm because,

the only thing that matters for Q-Learning to work is that,

all action state pairs will be visited many times.

But, why they were visited does not matter.

In the limit when all state action pairs are

encountered infinite number of times and data,

Q-iteration is guaranteed to asymptotically converge

to true optimal Veilleux function for all such pairs.

You can read more on conversion properties of

Q-Learning algorithms in the weekly reading for this week.

Now, given what we discussed in

the previous lesson on dynamic programming solution to them model,

it should be evidence that even though

the classical online Q-Learning algorithm is guaranteed to asymptotically converge,

it might just take it too long for practical purposes.

And the reason for this is that,

optimal hedges are obtained using

cross-sectional information across all multi-color paths.

But such cross-sectional information would be masked by

any online method of updating Q values and optimal hedge ratios.

However, the way out in this case is also clear.

What we have to do is simply to update Q values and

values of a star at all multi-color paths simultaneously,

as we did one of the computer optimal hedges in the previous less.

Because we are now in the setting of

the benchmark reinforcement learning the data already there.

And there is no need actually to take data points one

by one as is done in the classical Q-Learning.

In the next video,

we will see how the classical Q-Learning can be adjusted for such benchmark learning.