That brings us to the end of this course on

the Fundamentals of Reinforcement Learning.

Congratulations.

You now have a solid foundation to dive into

the concepts and algorithms we'll

cover in the rest of the specialization.

We started off with

an introduction to the idea of choosing

actions to maximize reward in bandits.

In bandits, we have a fixed set of

actions or arms to choose from.

Each action gives us a reward

according to some unknown distribution.

We would like to always pull the arm

that provides the highest reward on average.

Since we don't know the reward distributions initially,

we have to try each are many times

to get an idea of each average.

This brought us to

the exploration-exploitation trade-off.

Pull the arm that looks best now too much,

and you might miss out on another better arm

that only appeared worse due to insufficient information.

Spent too long exploring all the possibilities,

and you might sacrifice exploiting an arm

that you have good reason to believe

has much higher value.

We talked about various strategies

to handle this trade-off.

Dealing with bandits introduces

many interesting questions such as

how to handle the exploration-exploitation trade-off.

However, bandits don't include everything.

The k-arm bandit problem presents

the agent with the same situation at each time-step.

There is a single best action and no need to

associate different actions to different situations.

The impact of the agent's action selection is

immediate and the reward is not delayed.

To better model the complexity of real-world problems,

we introduce Markov Decision Processes or MDPs.

In MDPs, the action chosen by

the agent impacts not only the

immediate reward but also the next state.

In turn, this affects the potential for future reward.

So actions can have a long-term consequence.

We introduced the idea of return,

which is a potentially discounted sum of future rewards.

The MDP formalism can be used to

model many interesting real-world problems.

The solution methods we explore in

this specialization will be

applicable to a broad range of problems.

The first step is always to frame your problem as an MDP.

After introducing MDPs, we started

describe some of the basic concepts

of reinforcement learning.

The policy tells the agent had act in each state.

The value function estimates

the expected future return for

each stage or each state-action pair

under a certain policy.

Bellman equations link the value of each state or

each state-action pair to

the value of its possible successors.

Finally, we introduced dynamic programming algorithms.

These algorithms provide methods for solving

the two tasks of prediction and control,

as long as we have

direct access to the environment dynamics.

In the reinforcement learning problem,

we will not assume we know the dynamics.

After all, in the real world,

we can't always expect to know the effect

of each of our actions until we try them.

Dynamic programming algorithms provide

an essential foundation for

the reinforced learning algorithms,

we will cover in the rest of the specialization.

Give yourself a pat on the back

for getting through all this material.

You now have all the background to

understand the reinforcement learning setting.

In the next course, we will

discuss algorithms for estimating

value functions and policies directly from experience.

These sample-based learning algorithms do not require

or even estimate the transmission dynamics.

Hope to see you there.