Okay, now, we're ready to put together what we discussed in the previous video

about reinforcement learning and what we

discussed in the previous lesson, about sequence modelling.

Let's come back to the graph of reinforcement learning agent,

interacting with the environment and look at it one more time.

The thing that we did not yet discuss here,

but only implied so far is the fact that this process of an agent

perceiving the environment and taking actions is extended over a certain period of time.

During this period, the state of environment changes.

These changes might be determined by

the previous history of the environment and in addition,

can be partially driven by some random factors,

as well as agents own actions.

If relief agents actions aside for a second,

we already have a good modeling framework for the evolution of the environment.

Indeed as we saw in a previous lesson,

we can use Dynamic Latent Variable models or their non-parametric extensions such as

the RNN or the LSTM neural networks to describe dynamic systems.

So, let's recall what we had in the last lesson.

We talked about two different ways of modeling dynamic systems.

The first situation arises when we observe

all variables that constitute a state of a dynamic system.

In our case, by a dynamic system we mean the environment in our first diagram.

In this case, simplest dynamic system with

the randomness would be a first order mark of process as we discussed earlier.

If Y is an observed state of the environment,

then Y will follow first order mark of process shown by these blue circles.

In this approach, we assume that the environment is fully

observable and there are no hidden factors driving dynamics,

in addition to, observed components of vector Y.

The second way we discussed,

is to assume that there is some hidden random process X that drives the dynamics.

This is shown on the diagram on the bottom,

where purple circles denote the hidden states and blue circles show observed states.

It describes the working off Dynamic Latent Variable models.

If you want to describe a partial observed environment,

where some variables are observed and some are not observed or just non- observable,

Dynamic Latent Variable models are ideally suited for these goals.

It shows out that both Markov models and Dynamic Latent Variable models

can be used to formalize the dynamics for reinforcement learning problems.

Before we go to details here,

I just want to make one remark.

Most of very impressive progress reported recently

with reinforcement learning by researchers and

companies such as Google's DeepMind or Open AI such as playing video games,

walking robots, self driving cars,

and Go games was super impressive results such as AlphaGo and AlphaGo Zero by DeepMind.

All these groundbreaking achievements are done in a setting of

completely observable system using Markov model dynamics.

The progress was partially observed environments,

which might be model using Dynamic Latent Variable models was more modest so far.

Now, it is a million dollar question what kind of description is better for finance,

a full observable environment or a partial observable environment.

Well if we want to be pedantic,

we would say that the assumption of

a partially absorbed environment is more adequate for most of financial problems.

But if you want to be more pragmatic,

we should start with something simpler and not

try to climb the highest tree in the forest.

Reinforcement learning which is fully observable environment is far

simpler than for a partial observed environment and at the same time,

it's quite complex on its own to keep us busy for quite a while.

In the worst case scenario,

it will not be wasted time for real world financial applications anyway.

This is because we can always just choose not to use any hidden variables and only model

financial observables as a description of

our financial environment and

such framework would solve us some approximation to reality.

In our first week, we outlined the applications of reinforcement learning in

finance such as optimal trading strategies for quantitative trading,

optimal execution for brokerage trading,

and multipage portfolio optimization.

Quantitative models that describe the market environment for these tasks outside

of reinforcement learning use both assumptions of

a fully observable and a partially observable environment.

Therefore, we will start with reinforcement learning in

a fully observable environments and in this case,

the problem that reinforcement learning solves is called the Markov decision process.

Here is how it works, for Markov process,

we had the first order or Kth order Markov dynamics of

transitions between states of the system described by Dr. White.

Now, to bring reinforcement learning agent to the game,

we extend this framework by agent actions of these agents to the dynamics.

More specifically, we add a connection

variable a(t) here to describe actions of the agent.

So, that to take into account possible impact of agents actions on the environment,

we make a transition probability of the Markov process dependent on the action taken.

And this produces a Markov decision process.

A Markov decision process or MDP for short,

is the simplest modeling framework

that allows us to formalize the problem of reinforcement learning.

A tasks solved by Markov decision processes is the problem of optimal control,

which is the problem of choosing action variables

over some period of time in order to maximize

some objective function that depends both on the future states and actions taken.

Markov decision processes are known since 1950s from the work of Richard Bellman

who invented recursive method for

solving the problem of optimal control known as the Bellman equation.

He suggested the dynamic programming approach,

what I describe is dynamic programming,

as a way to solve the Bellman equation.

However, it turns out that dynamic programming is only feasible

if dimensionality of state space is low.

In practice it does not work well beyond four dimensions.

Reinforcement learning grew out of dynamic programming as

a way to solve Markov decision processes

for real world settings where dimensionality of

the state space is in tens or even hundreds.

In their next video,

we will talk more about Markov decision processes and the reinforcement learning,

but for this lecture let's concluded it as usual with questions for you.