Now, to make a recap. We need to learn an optimal policy. To do so, we need to define our policy, initialize it's main bit random, and then we need some kind of algorithm that improves this policy, explain the policy part. We have to define the kind of the behavior. We have to define how algorithm takes actions. There is two general approaches to how you do this. The first, the simplest one, is you learn an algorithm that takes your state and predicts one action. So basically, it doesn't learn anything but for the number of action or maybe the value of action if it's continuous. The second kind is the idea that you can learn a probabilistic distributions. So you can learn to predict the probabilities of taking each possible action. Those two approaches are kind of different in the way that what algorithms they work with, and they're presenting method, for example, can only work the stochastic policy. Now, let's try to compare those two approaches. Let's say that you have two algorithms with first one learns the deterministic policy and second learn stochastic policy sum distribution, and you try to compare them across different games. Is there maybe some case in which stochastic policy will learn an optimal policy and deterministic policy will fail? Any ideas? For example, let's say you have some kind of game where you have an adversary. You have an opponent which tries to get you to lose. See how you play rock-paper-scissors. Idea here is that the optimal policy in rock-paper-scissors, if you're playing against your recent opponent, is to pick all possible actions at random. If you only have one action, if you're using deterministic policy, the opponent is going to adapt and always show the item that will beat your current policy. And this way, you won't be able to learn anything. Now, the stochastic policy will be able to converge to a probability of one over three, in this case, one-third. And this way, it will very much better than a deterministic policy. Another feature of stochastic policy that kind of takes care of exploration for you. Remember in Q learning, you had to pick the optimal action but you had to throw or flip a coin and with probability action, you have to pick a random action as the optimal one. And to do so, to explore the space of possible strategy, space of actions. And this time, you won't have to do this because you already have stochastic policy that which simples actions at random. Now, deterministic policy, of course, you has a requirement of sampling exploration strategy. It cannot be seen as a pure boon of stochastic policies because sometimes, you do want to explicitly see what kind of exploration you want, and stochastic policy methods like [inaudible] method doesn't allow you to do so explicitly. Instead, it relies on some kind of penalties and regulations. There's one thing we have not discussed about the stochastic policies. The idea is that if you have, say, five actions, you're solving Atari and the actions have buttons, this kind of symbol to decide how to define probability distribution. You simply memorize the probability of each action, make sure they sum to one. Now, the different cases where you have a continuous value for actions. Say you are controlling a robot, and your action is what kind of voltage do you want to apply to the joint, to the motor there. In this case, you cannot simply memorize all possible outcomes, their probabilities, because there is continuous amount of them. How did you find the probabilistic policy in case of continuous actions? Any ideas? In this kind of thing, you could try some kind of distribution. For continuous variable, a normal distribution would do. Or maybe, if you have a bound that contiues forever, you could try some kind of better distribution or something similar that relates to your particular problem. The methods we're going to study right now are so-called policy based methods. There are two kinds of main families of methods in reinforcement learning. There are the value based methods and the policy based ones. The value based methods rely on, first, learning some kind of value function, V or Q or whatever. And then, they infer policy given the value function. Remember, if you have all perfect Q values, then you can simply find the optimal ones, the maximum Q function in this particular state, and this would be your optimal action. However, if you don't know you have some error in Q values, your policy would be sub-optimal. Policy based methods don't rely on this thing. They try to explicitly learn probability or deterministic policy, and they adjust them to implicitly maximize the expected reward or some other kind of objective.