Now we will develop

a simple stock portfolio model that we will use in our Reinforcement Learning approach.

We assume a universe of N stocks or possibly other assets

such as CTS and denote the vector of prices at

time t as P sub t. So the vector P t size N with all elements being positive.

An investor can invest in stocks and in addition,

can keep wealth in a risk-free bank account,

cash account that earns a risk-free interest rate rf.

We will denote this amount in cash as B sub t. Now,

a vector x sub t will describe dollar positions in all assets,

so it also has size N. The numbers of different stocks can be obtained now by dividing

components of vector xt by market prices Pt and rounding to the nearest integer value.

Negative components of vector xt describe short positions in stocks,

and positive components are long positions.

Now, a vector u sub t will describe trades made at the beginning of time step t.

Values of positions xt right after trades are instantaneously deterministic.

So that the new value Xt with the upper script

plus right after the trade is simply given by the sum of Xt and Ut.

Trades in this model have costs and marketing impacts which we will specify a bit later.

Now, the total portfolio value is just the sum of

all components of vector Xt plus the amount of cash.

We will right the sum of all components as a dot product of

a vector of ones and vector Xt as shown in equation two.

If we replace here values of Xt and bt by their post trade values Xt plus and bt plus,

we can compute the value of the portfolio

right after the trade as shown in equation three.

Now we assume that all changes in

stock positions can only be financed from a cash position.

This is known as a self-financing portfolio.

As trading has costs,

we will have additional costs for trading that will be added later.

But for now, we use the self-financing condition as shown in equation four.

The meaning of this condition is that the portfolio value cannot

instantaneously change simply by reshuffling the wealth between cash and stocks.

It only changes as a result of random returns on stocks in the portfolio.

Portfolio changes are described in this model as follows:

after a re-balance of the portfolio at the beginning of time interval t,

it's followed by an investment period till the end of interval t. We define

the returns of assets over this period in the usual way as shown in equation six.

The vector Xt plus one for the beginning of the next period,

which is the same as the end of the current period,

is then given by equation seven.

Please note that we use an element-wise multiplication here also known as

Hadamard product and is denoted by an encircled dot in this equation.

Next, we need to specify an asset return model for vector rt.

We will use a very simple parameterization shown in equation eight on this slide.

What this equation says is that the excess return,

that is, the difference of rt and rf,

is equal to a linear combination of some vector of predictors that we call zt,

and vector ut of trades of timing t with certain weights.

The weight Wt will stand for metrics of weight for predictors zt,

and matrix Mt is a matrix of marketing impacts for trade.

And finally, Epsilon t will be a vector of residuals with mean 0 and variance Sigma

t. Now we can compute the new portfolio value at time t plus one.

This calculation is shown in equation 10.

We just multiply the vector Xt by a vector of ones,

and using the expression for Xt plus one,

this gives final expression in equation 10.

Now we can compute the change of the portfolio in excess of a risk-free growth,

and this is computed in equation 11.

Here, we subtract the theorem one plus rf times Vt from Vt plus 1,

and using the self-financing condition,

this produces the final expression in the equation 11.

This equation can be applied at any time step except the very last one.

For the last time step,

we have to impose terminal conditions that would

be appropriate for a search inquiry we want to address.

For example, if we deal with the index tracking or a similar benchmark omission problem,

then terminal values XT, with capital T,

should match a given values X sub B of a benchmark portfolio.

For example, S&P 500 portfolio.

These can be used to fix the last action UT,

capital T, as shown in equation 12.

So that the last action is deterministic and therefore

drops out from the problem of action optimization,

which should be solved for the remaining actions U zero to U sub T minus one.

For other settings, we can formulate different terminal conditions.

For example, for an optimal investment portfolio,

we can impose zero terminal conditions for X sub capital T,

meaning that all stock positions will be traded for cash at time t.

This does not necessarily mean that we have to really close all positions at time t,

but mainly, this condition serves as

a way to express the value in stocks in terms of a cash value.

For the optimal portfolio liquidation problem,

we also impose zero terminal conditions on X sub capital T. Finally,

we can discuss initial conditions for the problem.

This will be specific to a particular search inquiry which we

consider the general portfolio optimization problem.

For example, if we deal with an optimal stock execution,

an initial condition will be an initial dollar value

of number of stocks that we need to sell.