[MUSIC]

Before we move on to the [INAUDIBLE] allocation,

let's see what Dirichlet distribution is.

So it's probability density function is given as follows.

It is a distribution over the vector theta.

Its components should solve to 1 and b non-negative.

This is called a simplex.

A really convenient way to interpret this is a triangle.

So I have a triangle with three nodes, they will correspond to

the coordinates (1,0,0), (0,0,1), and (0,1,0).

And the vector theta will correspond to the barycentric coordinates of the point.

For example, this red points would have the coordinance 0.3, 0.1, and 0.5.

Since first and the third coordinates are large,

this point is near the left and the upper nodes.

So the distribution is parameterized by parameter alpha, which is also a vector.

We assume that it's [INAUDIBLE] non-negative.

The probability density function is some normalization constant one of over

meta function of alpha times, a product over coordinates of the vector.

The corresponding coordinate theta K power alpha k-1.

So by varying the parameter alpha, we can get different shapes of the distribution.

For example, if all alphas are less than 1.

We will have some ball-shaped distribution.

And the most probable positions would be sparse vectors,

that have one coordinate that is large, and the others are small.

If on the other hand all offers are greater than 1,

we will have a unimal distribution.

We can also have different values of coordinates of alpha.

For example, if alpha is 5, 2, 2, as shown left, the distribution would be

concentrated around the first node, that has coordinates 1, 0, 0.

If for example we have alpha equal to 5, 5, 2,

then the distribution will be concentrated around the bottom edge of the triangle.

The statistics of this distribution are given as follows.

For example, the mean of the expected value of the coordinate i,

the expected value of theta i, can be obtained as the ratio between

corresponding alpha i over the sum over all values of alpha.

This is alpha0.

And the covariance is given as follows.

Also note that when k is equal to 2, we get a beta distribution.

So the beta distribution is a case the Dirichlet distribution,

when we have only two dimensions.

All right, as always,

let's see how we can apply this distribution in a real world example.

Imagine that you develop an online game, and

the characters can select the strength,

the stamina, and the speed of their characters.

So the player has some credit that equals to 1, and

he can distribute it for these three criterions.

For example, here, the player1 assigns more credit for

the strength, and so his play would be stronger.

However, he compromise for the stamina and speed.

The second player however, assigned equal amounts of values for

stamina speed, and he compromised for the strength.

The third player assigned equal credit for all three criterions.

And so if we collect the statistics over all our players,

we could have something like this.

It is really intuitive to model this using the Dirichlet distribution.

We could estimate the parameter alpha from our dataset,

from the statistics that we gather from the players.

And we can estimate it to be for example, 3, 1, 1.

This would mean that in your game,

most of the players prefer stamina over strength and speed.

So actually, there is one more property that I want to tell you about.

The Dirichlet prior is actually a conjugate to the multinomial likelihood.

If you don't remember, let me remind you what conjugate prior is.

So the prior p P of theta is conjugate to the likelihood P of X given theta.

If the posterior lies in the same family of distributions as the prior.

So here's our multinomial likelihood.

It equals to some normalization constant

times the product of all coordinates, theta i, power xi.

So this is a distribution over counts.

You will have for example a dice that has six sides.

The key here would be equal to 6.

We will have like six possible outcomes.

And x, for example x1, would be equal to the number of times we had number 1.

And also n equals to the number of times that we conducted our experiment.

So actually all x case should sum up to n.

And here is our prior, Dirichlet prior.

We will try to compute the posterior.

We would multiply the likelihood and the prior.

And so the posterior would be proportional to the following function.

Product of theta k power alpha k, plus xk minus 1.

We obtain this formula by rearranging the terms after multiplication.

And also notice that it has actually a beta distribution up to a normalization

constant.

All right, so

we can compute the posterior by multiplying the likelihood and the prior.

If we rearrange the terms, we'll get the following formula.

This will be the product over all dimensions.

The probability of the corresponding dimension power alpha k plus xk minus 1.

Now, hear that it is actually a Dirichlet distribution,

up to a normalization Gaussian.

And so the posterior is actually a Dirichlet distribution over theta.

And the vector of parameters would be obtained as alpha k plus xk,

at each position.

So we just sum up the two vectors.

[MUSIC]