0:00

So you already know by this time there are some basic strategies of exploration.

For example, the Q-learning,

and DQN and Sarsa,

and all those value based algorithms do exhibit.

They try to take

other best action probability one minus epsilon or random action probability epsilon.

And this basically, formally solves

the exploration problem because eventually they're going to

find the epsilon policy, or will we?

Another strategy which looked a bit more advanced,

although it's just as heuristical,

is the so-called Boltzmann exploration policy,

exploration strategy, again, I'm sorry.

In this case, you want to take your Q values and

make them into something that looks more like a probability.

In Deep Learning, you know that if you have some arbitrary values,

you can make them into probability by applying soft marks,

and multiplying or dividing them by some coefficient to account for

how ruthlessly do you want your agent to explore.

Let's look at some methods we've built in exploration.

For example, cross-entropy methods and the policy

gradient method family like reinforce or actor-critic,

they all learn the probability of taking action explicitly.

And you could, of course,

affect them with the regularization of entropy or something similar.

But in general, the algorithm decides this exploration policy for itself.

So now to see how all those exploration strategies

fare against our newly invented notion of regret,

let's begin with the simple one.

This time, we're going to explore the epsilon-greedy policy,

with epsilon precisely equal to 0.1.

This means that at any moment of time you have

10 percent probability of behaving randomly,

and 90 percent of picking whichever action is

maximizing the Q function you currently have, the imperfect one.

Why don't you answer me this question,

I have this notion of regret which can be treated as a function

of time up to which you add up those coefficients.

I want you tell me what's going to happen with regret for this epsilon-greedy strategy.

So, what's going to be the curve which

represents the stack difference between epsilon-greedy policy,

learning over time, and the optimal policy which

picks the action with highest return all the time.

Well, yes, unfortunately it's the worst possible case,

it's going to grow linearly.

And the reason here is that if you explore the constant epsilon of 0.1,

means that even if you have found true Q values,

you can still not get

the optimal policy out of it because the remainder of time you're going to explore,

and some of this exploration is going to be bad for you.

So, this basically means that your graph is going to grow linearly

because there's going to be some constant offset between

0.1 greedy strategy and the greedy one,

which is the optimal, which is going to stack and grow and

grow again until it reaches infinity, well, nearly.

So, this is a bad scenario,

and surely we're going to fix it with some of

the tactic we've been learning over the previous weeks.

How do we treat this Epsilon greedy policy better so that it converts

to a constant regret?

So, of course there's more than one way you can do that.

But, the general pattern is that you have to start

with suffixed epsilon and then decrease it over time.

And there is a notion of so-called greedy in the limit strategy,

which means that your exploration eventually converges

to an optimal greedy policy after infinite number of time steps.

For example, you can start with Epsilon equal one or whatever number you prefer,

and at every time step you can just multiply it by 0.99.

This constant is of course dependent on a particular setting,

but this means that at no fixed point of

time will your algorithm terminate the exploration get exploiting.

But formally, in the infinity,

it will asymptotically converge to a greedy policy.

This is theoretically good because it allows you

to babble about how your algorithm's regret is converging somewhere,

of course, provided that it is able to learn optimal policy in this region.

We've also seen some other mechanisms.

For example, the classical Deep Q Network, DQN,

uses not this multiplication by some constant which is near one,

but the linear trajectory where it subtracts

some constant value from espilon which is zero or some small constant value.

Theoretically, it's less than perfect because if you're using Epsilon,

then you should have reach zero Epsilon before you've gone to the optimal strategy.

Your grid is going to grow because you have not learned the optimal one so far.

But again in reality doesn't matter that much.

This is basically how regret works

in terms of the strategies you've already been using from which to.

You could also take say Boltzmann Exploration Policy,

you find out that if you use the same strategy with tau,

if you start with constant tau and then decrease it over time,

then you'll devise strategy which is

greedy in the limit in terms of this Boltzmann Exploration Policy.

We need better or worse than Epsilon greedy

depending on a particular environment and particular regime you're using.

So, this is basically duct tape challenge where the winner is not

that much of theoretically better algorithm but the one

that's just slightly better fits this particular problem.

Instead, I want you to focus on something, well, qualitatively different.

I want you to, and I'm going to get you through some of

the algorithms that explicitly pick actions that are most interesting to our agents.

This is precisely the topic of our next section.