So, in the last video,

we talked about what the Black-Scholes model does.

It finds a perfect replicating portfolio whose price is always equal to

the option price no matter what the stock price goes in the future.

By the law of one price the fair option price now

should be equal to the price of these replicating portfolio now.

So, if you know how many stocks you should buy right now after you just sold an option,

you know how much you should pay for the option.

So far so good, but if you just say that the option price

today should be equal to the price of the replicating portfolio today,

would you not solve the problem yet?

Why is that? This is because we don't know

how many stocks we should buy for our replicating portfolio.

What the Black-Scholes-Merton model essentially does it answers exactly this question.

Black-Scholes model does it by just mimicking

the option by a very frequent shuffling money

between your stock investment and your bank cash account.

The model gives you a way to determine how many stocks you

should have in your replicating portfolio in each scenario for the future.

Such that the total portfolio value which is the value of the option plus

the replicating portfolio will always be

zero in the future no matter what happens with the stock price.

It turns out that this amazing result is

possible but only if your time steps are infinitesimal.

For this very special setting and picking

the very specific model for the stock price evolution in the future,

the Black-Scholes model finds a unique option price and

unique number of shares that you have to have in your replicating portfolio now.

So, in a nutshell the Black-Scholes model established the fact that

even though the option price can and will change in the future

because it depends on the future stock price which is also unknown,

a unique fair option price can be found by

using the principle of one price alongside with

the method of pricing by replicating for

the various special choice of dynamics of the stock price.

Another name for this procedure is pricing by hedging,

because you're replicating portfolio is a hedge for your option.

In Black-Scholes model the stock dynamics has chosen

to follow the law of Geometric Brownian Motion with a drift.

It turns out that this choice leads to

a closed form expression for a unique option price for

a European coal or put option on the stock

which is given by the celebrated Black-Scholes formula.

But simultaneously, the classical Black-Scholes model leads to

a somewhat paradoxical conclusion that options are altogether completely

redundant as they can always be perfectly replicated by

a simple portfolio made of a stock in the bond or a cash account.

Now, if this were indeed the case in real life that these options were totally redundant,

nobody would ever trade them except for possibly very bored traders.

Yet option trading is a multibillion business where people make and lose money daily.

Traders use options and other financial derivatives

both as investment vehicles and as hedging instruments.

And this means that options are not redundant.

The reason that options are not redundant is that they carry

a substantial risk notwithstanding the preposition of the classical Black-Scholes model.

Nobody in the market trades options at their Black-Scholes prices.

And differences between Black-Scholes prices and trader prices reflect

a dealer's perception of actual risk embedded in options.

Financial professionals are well aware of the fact that

the classical Black-Scholes model completely eliminates any risk in

options for the sake of stability by

making two strong assumptions that do not hold in practice.

The assumptions of the classical Black-Scholes model that make it

totally miss risk and options are continuously hedging and zero transaction costs.

These assumptions do not hold for real markets where the hedging is

always done with their finite records, for example daily.

The reason you can't hedge continuously is because

trading stocks involves additional transaction costs.

If you rebalance your hedge replicating portfolio

continuously total transaction costs would be infinite.

So in reality, rebalancing is never done too frequently,

not to speak about continuously balancing.

This is one of the reasons why all practical users of the Black-Scholes model

involve some modifications of either dynamics or hedging and pricing methods.

Now, here is our plan for what we're going to do next.

We will use a discrete time version of the Black-Scholes model as

a simple laboratory to study financial reinforcement learning models.

On the one hand, this is a well understood extension of

the Black-Scholes model that brings back some realism of actual option trading

by considering rehedging at discrete times as opposed to

a continuous rehedging obtained in the Black-Scholes limit of infinitesimal time steps.

On the other hand, keeping rehedging frequency finite which is similar

to how a rehedging is done in real life will allow us

to focus on the key objective of option pricing and hedging and

trading which is risk minimization by hedging in a sequential decision making process.

Now, there are various extensions of

the classical Black-Scholes model to a discrete time

setting which are well studied in literature.

We will just take one such formulation and use it to set

up environment for reinforcement learning where we model

an option seller as a reinforcement learning agent that

hedges it's risk and an option by trading in the underlying stock at discrete times.

To simplify things even further,

we can discretize the state space in order to map the problem

onto the setting of a finite state Markov decision process.

And this will let us try some simple reinforcement learning algorithms

for a discrete state space such as the famous Q-learning and not

worry about additional complexity due to the need for

functional approximation that would be necessary within a continuous space formulation.

However, by simply reversing the last step of space discretization in our scheme we could

use the same framework to test

continuous action and continuous space reinforcement learning problems in finance.

So, in a nutshell this is what we do.

We generate data by simulation of stock price history alongside with actions that is

rehedges that implement risk minimization strategy and rewards from taking these actions.

Then we will give this data to Q-learning algorithms and task them to find

the best hedging strategy which means risk minimization strategy directly from this data,

and without knowing anything about the dynamics and

hedge strategy that generated these data.

But because we already know the best strategy we can continuously monitor

the progress of the Q-learning algorithms towards these goals.

We can also randomize actions equally in the data, for example,

by intentionally doing sub-optimal hedges from time to time,

and again ask Q-learning to find

the best hedging strategy by looking at data collected under such a sub-optimal strategy.

This would be a simple prototype of how reinforcement learning could be

used if it works in a real trading environment.

It goes like that.

Take the history of the market and own trading strategy,

then give it to reinforcement learning

agent and ask it to improve the strategy by keeping the same goals.

On the other hand, such a problem setting is quite

standard for Q-learning which is an off-policy algorithm that

is able to learn an optimal policy even when

the data used for trading is produced using a sub-optimal policy.

All such questions of direct relevance for

financial applications become answerable in such setting.

Now, let's follow up with the next video to see how it works.