Now, let's discuss the Bellman Equation in more details.

After we understand how we can work with it,

it will make it easier to understand what exactly Reinforcement Learning does.

So, let's start with the point where we left in the last video.

We introduced the notion of the value function V(s) which also depends on policy pi.

The value function V(s) satisfies the Bellman equation which is again shown here.

It involves the value function both from the left and on the right of the equality sign.

The value in the left hand side is the value function in the current state s.

The values in the right hand side are

the value functions that will be obtained in these other states.

As by definition, the value function measures the quality of a given policy.

We can define their optimal value function as

marks of all value functions over all possible policies.

We will denote this optimal value function as V*.

If we, now, apply a max separation to both sides of the Bellman equation

within a Bellman equation for

the optimal value function which is shown in the last line here.

Please note that this equation involves only the value function.

But actually, our goal is in the Reinforcement Learning is to find the optimal policy,

not the optimal value function.

The optimal policy can in fact be read from the second term in the Bellman equation.

The optimal policy pi star is simply the policy that maximizes this term.

So, we get the definition of the optimal policy as an arg max of this expression.

Finally, I would like to comment on the meaning of the optimal policy.

The optimal policy is a function prior first

that gives the best possible action in all states of the world.

That is for all possible values of s. Therefore,

it means that V star should be greater or equal than

the value function for any other policy for all states of the world.

Now, let's look about how we can

solve the Bellman equation for the optimal value function.

One of the simplest methods that works well as long

as the state action space is discrete and small is called a value Iteration.

Here is how it works.

We start with the initialization of

the value function at some initial values for all states.

Usually, initialization at zero for all states works fine.

Then we keep iterating

the calculation of the value function using the Bellman equation itself,

where for each iteration,

we use the result of the previous iteration to compute

the right hand side of the equation and reassign it to the left hand side.

Now, there exist several ways to update the value function in such value iteration.

One way in computing the value function for all states in

our state space and then update the value function in all states at once.

This is called Synchronous Updates.

The other way is to update the value function on the fly,

as it's been recomputed in the current iteration.

This is called Asynchronous Updates.

For either way of updating,

it can be proven that the algorithm converges to

the optimal value function that we will again call V star.

Now, after V* is found,

the optimal policy can be found using the same formula as before.

As you can see, the basic algorithm is very simple and works like a charm,

well until your actual space is discrete,

and has a small number of states.

If this is not the case,

some other approximate methods are needed to find the optimal policy.

Or when you are to solve the Bellman equation for the optimal value function.

But before moving to this,

I want to briefly discuss another classical method called the Policy Iteration.

Here's how it works.

We start with initialization of the policy pi.

Let's call this initial policy pi note.

Usually pi note is initialized randomly.

After this we repeat the following loop

with the two calculations at each step until convergence.

The first calculation is called Policy Evaluation.

Here we compute the value function for a given policy for this iteration.

We can do this by using the Bellman equation for V,

not the Bellman equation for the optimal value function V*.

As the Bellman equation for V is just a linear equation,

this step is straightforward.

The second calculation is called policy update.

This is done again by applying

the arg max operator to the right hand side of the Bellman equation.

Please note that these two calculations which we have to make

repeatedly within policy iteration methods have quite different computational burden.

The policy updates step is straightforward

and can be done using standard optimization software,

but the policy evaluation step involves solving the Bellman equation.

And if the dimensionality of the state space is large,

doing it repeatedly in the process of policy evaluation can be quite costly.

Yet many practical problems of optimal control in

robotics supply chains utilities and so on involve

large discrete state actions spaces or even continuous state action spaces.

In these settings, good old methods for optimal control based on

dynamic programming of Bellman introduced in 53,

and algorithms like value iteration or policy iteration do not work anymore.

Reinforcement Learning grew out of dynamic programming as a way

to solve practical industrial problems of optimal control,

where dynamic programming was inadequate.

Let's talk about Reinforcement Learning in

the next page after our traditional quick Q and A minute.