[MUSIC] Now our main protagonist for this and a couple of next sections will be the so called policy gradient. The idea behind this is that instead of trying to do some cunning heavy wizardry methods, like picking top sessions, calling them elites for some reason and just trying to with those sessions. Maybe you'd be better of if you just tried to write down what you want to optimize, complete the of this thing, and then follow through with via gradient dissent or back propagation neural networks. Let's try to see how it works one step at a time. Let's begin by introducing the notion of so called expected reward. You'd probably used this thing on more than one occasion actually. But this time we're going to have to explicitly write it down and make some mathematical derivations with it. So they yield of expected reward is that it is, well, what it seems to be. It is an expectation over the rewards that you're going to get if you play a lot of sessions with your agent. Now, during our first week, we had an expected reward without gammas, without discounts. And this was just an expected sum of rewards of our session. Now, there's also the expected discounted reward. Follows the notion of discounted reward from weeks two and three, and Q-learning, value iteration and so on. And simply a sum of your rewards multiplied by gamma to the power of how much time has passed since this moment of time. The idea here is that you can try to maximize the average discounted reward over states. Or you can try to maximize the discounted reward you get from the first state on. So let's start with the undiscounted version of our rewards because it produces simpler math. Even further, let's simplify our problem by planning that there is only one step per session. So your agent gets born in some random state. It takes one action. It observes the reward, it may trade on this reward. And then a new session begins. So there is no continuation in one session. So if we write down what's actually hiding behind this need expectation, you'll see the nasty double integral here. So there is an integral over all possible states. They are permissible for visiting the state. Then integral over actions and probabilities of taking this action, and then you will multiply the state by the reward. This is how a mathematical expectation works for the general case where you have infinite continuous amount of options. Now, of course, if you have, say, five or six actions, then it's okay to replace the integral with a sum. If you're still feeling lost, let's walk through this formula step by step. The first thing here is the state visitation frequency. We have a provision of visiting state s. And in this case, so far it's not dependent on an agent. However, if you use a more complicated notion of multiple step process, then visiting state is something that agent can influence. So this would be also dependent on agent's policy. The second thing is that the pi factor is the probability of agent taking the action E in the state. And this is what your agent actually influences here. This is the unilateral prediction or maybe they're of the table of all possible probabilities of all action probabilities. Now the final term is the reward. In this case you can use any notion, it's still just one time step. Now remember this is the thing we're trying to maximize. But so far, it's hard to even compute, not talking about any optimization at all. So if you try to explicitly estimate the value of this J, you'll have to sum over all possible breakout frames, sum over all possible actions. If there's continuous amount of actions, you're screwed. And then compute the rewards for the whole session if you're starting from there. So surely there is a simple way you can estimate this thing, although approximately. How would you do that? To even compute this, if you remember, this is still a mathematical expectation. And instead of trying to compute expectation exactly, you can simply approximate it with sampling. So you play, let's say, 1,000 games, and your average over all the states and actions you have met in those games. So finally some good news have arrived. You can at least approximately estimate the value of this J by using the Monte Carlo sampling like this. Now let's go to the second stage. We want to optimize the J, we want to compute the gradient of J with respect to the branches of your policy. And to do so we need to compute the gradient of the sample's approximate version because we can't deduce the regional version, it's too hard to compute. The question is, remember we had all the those cool frameworks for neural networks, like TensorFlow. Can we use them now? Can we maybe use TensorFlow to explicitly compute the gradients of this formulation of J respect to the theta, the value of the policy. Or maybe there's something that prevents us from doing so. Well right, so the problem with TensorFlow here is that it can only compute the derivatives with respect to something that is in the formula. And if you take a closer look at this particular version of Monte Carlo sample J, you'll notice that it has no place for parameters, for thetas, here. The problem is that this sample definition, it does depend on parameters, on policy, but it depends on them in the summation. So you have sampled the sessions from your policy. But you don't remember the values, you don't remember the probabilities any longer. So no, TensorFlow won't be able to compute this thing for us. And in fact, not only TensorFlow, but any mathematician, if you give him the second formula only, will probably call you a jerk if you ask him to differentiate. Okay, so we have to do something else. The answer will go here, we have to estimate the gradient of J. Now some simple, so-called duct tape approaches you could try to employ are, for example, the so-called finite differences. So instead of trying to compute the derivative, you can just pretend that the infinitely small value in the deficient of the derivative is equal to, say, 10 to the power of minus five, this epsilon here. You can compute something that looks like a derivative, but it's not derivative because the value is not infinitely small. This will technically work. It will require you to, well get the J on the compost by sampling. And then change this policy ever so slightly by this small value of epsilon. And then find the new value of the policy by sampling more sessions. Now another way that you are probably more familiar with by now, is use cross-entropy method. What it does is it tries to somewhat maximize something that looks like J by sampling a lot of sessions. And then taking those in which the J from a particular session was higher than that of other sessions. So the expected reward was larger. And both of those methods will technically work, although they have some problems. Now this time I would ask you to criticize those methods. So while those two methods do work in theory, in practice they have an issue of being very hard to efficiently estimate in any practical situation. For example, if you are trying to solve breakout via these finite differences method, it would actually take you to play say 100 games to estimate this first J plus epsilon. And then it'll take you another 100 games to estimate the second J. And even then the amount of noise you introduce by sampling would be still much larger than the difference between the two policies, especially if J is sufficiently small. And if you use large values of J your gradient would be useless for anything more complicated than a linear model, or maybe a table in this case. Stochastic optimization, like the crossentropy method is in this case much more practical in terms of how to use samples. But it still has some problems. For example, remember, if you have some innate randomness. For example, you have a slot machine in your environment or there are some physically random process. In this case you'll have to use your crossentropy method with some tweaks to prevent it from favoring the lucky outcomes, from believing that the elite sessions are elite because they have some way of tricking the slot machine. The method we are going to study next will mitigate both these problems by the way it computes the derivative of J. It won't use any high wizardry. Instead it will try to find an analytical derivative of J which is easily approximated with sampling. [SOUND]