Remember our doctor running the medical trial? What would happen if this doctor already knew the long-term outcome of each treatment? Choosing the appropriate treatment would be trivial. Unfortunately, this is often not the case. Usually, the doctor will run many trials to learn about each treatment. Each day, the doctor could use all the previously collected data to estimate which treatment they believed to be the best. Let's learn how that might work. Today, we will discuss a method for estimating the action values called the sample-average method. We will use this method to compute the value of each treatment in our medical trial example. Then, we will describe greedy action selection. Finally, we will introduce the exploration-exploitation dilemma in reinforcement learning. Before we get started, let's recall the definition of an action value. The value of selecting an action Q star is the expected reward received after that action has been taken. Q star is not known to the agent, just like the doctor doesn't know the effectiveness of each treatment. Instead, we will need to find a way to estimate it. One way to estimate Q star is to compute a sample-average. We simply record the total reward for each action and divide it by the number of times that action has been selected. To understand the sample-average estimate intuitively, let's look at one action, action A. The estimated value for action A is the sum of rewards observed when taking action A divided by the total number of times action A has been taken. We use t minus 1 because the value at time t is based on actions taken prior to time t. Also, if action A has not yet been taken, we set the value to some default like zero. Let's go back to our medical trial example. A doctor must decide which of the three possible treatments to prescribe. If the patient gets better, the doctor records a reward of one. Otherwise, the doctor records a reward of zero. Let's say we know Q star but our doctor does not. The doctor gives the first patient treatment P on time step one, and the patient reports feeling better. The doctor records a reward of one for that treatment and updates the estimate of the value. So far there's only one data point, so the estimated value for treatment P is one. A second patient arrives. The doctor randomly prescribes treatment P again. It fails, the doctor records and rewards zero, and updates the value estimate for treatment P to 0.5. The estimated value for the other actions remain zero since we define the initial estimates to be zero. Let's fast forward time a little bit. After each treatment has been tried a few times, we can calculate the estimated values from the observed data. As the doctor observes more patients, the estimates approach the true action values. In reality, our doctor would not randomly assign treatments to their patients. Instead, they would probably assign the treatment that they currently think is the best. We call this method of choosing actions greedy. The greedy action is the action that currently has the largest estimated value. Selecting the greedy action means the agent is exploiting its current knowledge. It is trying to get the most reward it can right now. We can compute the greedy action by taking the argmax of our estimated values. Alternatively, the agent may choose to explore by choosing a non-greedy action. The agent would sacrifice immediate reward hoping to gain more information about the other actions. The agent can not choose to both explore and exploit at the same time. This is one of the fundamental problems in reinforced learning. The exploration-exploitation dilemma. We will discuss this more in an upcoming video. That's it. In this video, we introduce the sample-average method for estimating action values, and we define the greedy action as the action with the largest value estimate. See you next time.