0:02

In the previous videos, we become familiar with

the policy evaluation and policy iteration algorithms.

And now, we are going to fit all the puzzles together,

to get an algorithm for finding optimal policy in a

model-based setting for variant market decision process.

In fact, we already know all we might need to define such algorithm,

which is called generalized policy iteration.

Generalized policy iteration is one of

the most fundamental ideas in reinforcement learning.

This idea can be recognized in almost any reinforcement learning algorithm,

and thus, it is very important to understand.

In its simplest form, it tell us that

inter-leaving policy evaluation with policy improvement,

will eventually make our policy an optimal one.

In fact, when we think a little about these procedures,

we realize that they are both competing and cooperating. Why do they compete?

Is self-evident. That is when we evaluate a policy,

we make it no longer to be greedy with respect to it's value function.

And where we make a policy greedy with respect to its value function,

we change it's value function.

This competing behavior is all the more interesting in that,

it eventually drives a policy to become an optimal one.

That is to satisfy the Bellman optimality equation.

And that is why one could say that

these two procedures are also cooperating with each other.

But note that this generalized policy iteration idea,

doesn't tell us how many times should we call

policy evaluation procedure in-between of policy improvement goals,

in what order should we update state values, and so on.

In theory, all the details are not important as long as

policy evaluation and policy improvement continue to interact with each other.

In that sense, the generalized policy iteration is a very robust idea.

What do I mean by that robustness?

Well, generalized policy iteration result policy,

does not depend on initialization.

It is not susceptible to local optimus.

It does not need complete policy evaluation.

That is evaluation is needed not to be complete,

neither in the sense of updating all stages during

the policy relation procedure nor in

the sense of complete convergence of policy evaluation.

It also does not need to improve policy in all states at any particular step,

as long as it guarantee that policy will be

updated in each and every state once in a while.

So, just for an example of what does generalized policy iteration allow one to do?

Think of updating only one state at each policy evaluation step.

And in this case, generalized policy iteration will converge to globally optimal policy.

To make things even more interesting,

think about updating it in a random direction,

which is correct only in the expectation,

and generalized policy iteration again,

will converge to globally optimal policy.

Isn't that impressive? That even under

these conditions generalized policy iteration is guaranteed to work well?

Surely in practice, it will converge the faster,

the less obstacles it counters.

To further get you a bit of intuition of how different

could generalized policy iteration versions be,

we are going to talk about two extreme instances of generalized policy iteration.

First one is called policy iteration and the second is called value iteration.

These two approaches differ in how policy evaluation stage is performed.

In brief, policy iteration requires precise evaluation of policy,

before each policy improvement step.

That is why I need to evaluate a policy,

until a numerical convergence of it's

state values and before doing a single policy improvement.

At the other extreme, consider an algorithm which

performs only one iteration of policy evaluation,

between any two successive calls of policy improvement procedure.

This slider algorithm is called value iteration and is also very widespread in use.

On this slide, you can see a pseudocode of policy iteration algorithm,

consisting of three stages.

It's first stage is an initialization of state values.

At this stage, it is very important to initialize

value of alternative states, equal to zero.

The second stage is basically

a policy evaluation procedure which we have already encountered before.

The third stage is already familiar to us,

policy improvement procedure, with only one extension.

In particular, after policy improvement procedure,

we check whether a new policy is equal to the old one.

If it is equal, then the state values of these policies are also equal.

And thus, we have found an optimal policy,

which satisfies the Bellman optimality equation.

If this is now the case,

then we need one more policy evaluation,

and one more subsequent policy improvement.

Another very important instance of generalized policy iteration is value iteration.

In fact, keeping assignal book keeping such as

state bill of value initialization and termination criteria,

value iteration is essentially a one line assignment.

This assignment effectively combines

one truncated step of policy evaluation and full policy improvement step.

One intuition about this assignment is that it is precisely equal to

the Bellman optimality equation for v star turn into assignment operation.

A distinctive feature of this algorithm is that it

does not explicitly stores the probabilities of actions for each state.

That is it does not require an explicit policy to perform iterations.

The policy however, could be recovered from

value function using that arg marks of q function,

and this q function in turn is recovered

from value estimates and transient probabilities.

This policy recovering step is a very final step in the value iteration algorithm.

Please know that in any intermediate step of this algorithm,

the computed value of function,

may not correspond to any possible particular policy,

that is there may be no policy at all that has such value function.

No policy by which with v pi equal to v from value iteration step.

And this is so by design because,

any intermediate v function is just an approximation of

optimal v star but not v pi for some policy.

Well, we have introduced and discussed

two approaches for finding optimal policy with dynamic programming.

One is policy iteration and another one is value iteration.

Do these approaches differ in the amount of time spent

in policy evaluation phase between two successive policy improvements?

Policy iteration performs iteration till the convergence,

whereas, value iteration performs only single step of policy evaluation.

Do these approaches along with

the generalized policy iteration are fundamental to the field of reinforcement learning?

In future years, you will encounter algorithm that will

look like policy and value iteration with some minor changes.

For example, you'll definitely encounter the q learning which is

the value iteration algorithm applied to the q function instead of v function.

But how to compare it to these approaches which one is better?

Well, this is a question without a clear answer.

One cycle of value iteration is faster than one cycle of policy iteration.

But value iteration, it requires much more cycles than policy iteration.

One might think that the truth is somewhere in between all the two extremes.

Well indeed, it might be useful to experiment with

the number of steps spent in policy evaluation.

Somewhere between one and infinite number of steps,

there might be a best algorithm for your task at hand.

This concept of intermediate number of steps, say,

k steps spent in policy evaluation,

has its own name, it is called modified policy iteration.

And as I have said previously,

this may be the way to go for your particular application.