For the final the example,

we'll need a beta distribution.

Its probability against the function is given as follows.

It has two parameters that are assumed to be positive and

the X is assumed to be in the range from 0 to 1.

So this distribution is useful for modeling some variables that have fine support.

The plus or the probability density are given as follows.

You can either get a binomial distribution or a U-shaped distribution.

You can also get a uniform one.

The functional form looks as follows.

You have X power A minus one times one or one minus X or B minus one.

And it also has a normalization constant.

And the normalization constant can be expressed through the gamma function as

the ratio between the gamma of the sum of

the parameters over the products of gamma functions.

The mean, mode and the variance are defined as follows.

For example the mean would be A over A plus B.

Let's see how we can use the beta distribution to model our favorite movie rank.

So imagine you have your favorite website which ranks the movies.

And science they rank for each movie

where one means the best movie and zero means better than the road.

Also since some new movies appear the rank of your favorite movie may vary a bit.

So for example the rank of my favorite movie is 0.8

and it varies somewhere around it with the standard deviation 0.1.

Since the rank is distributed from 0 to 1,

we can use beta distribution to model it.

So again we can plug in the numbers 0.8 and 0.1 squared into

the mean and the variance formulas and find out that A should

be equal to 12 and B should be equal to 3.

And the probability density is given as follows. All right.

Actually the beta distribution is conjugate of the Bernoulli likelihood.

Let's see what happens.

Here's the Bernoulli likelihood.

We have a data set of points.

Those could be either zeros or ones and one

is the number of times the number one appeared in X,

and 0 is a number of times zero appeared in our data set.

And so the likelihood is given as theta power and

one times one minus theta power and zero.

Let's select the beta distribution as a prior.

If we drop the constants we'll get the following function.

Now to compute that posterior,

we need to multiply the likelihood by the prior.

If we plug in the formulas from above we'll get the following formula.

If we rearrange the terms,

if we group the terms of theta and one minus theta,

we'll get such function.

And again we can recognize the beta distribution from this functional form.

It will have new parameters that are equal to N1 plus 8 and N0 plus B.

All right.

Let's summarize what we've seen now.

Here's our base formula again.

And we want to compute the posterior,

however we can't do this because of the evidence.

What we could do, we could choose a prior that

would help us to easily compute the posterior and those are called the conjugate priors.

The conjugate priors have a lot of pros.

For example we get an exact posterior.

It is also easy for online learning.

In this example we've seen that the posterior can be approximated with

this simple formula and you can easily plug it in as you get more and more data.

However there is one huge minus.

For some models, the conjugate prior maybe be inadequate.

And so in the next weeks we will see more advanced techniques to

compute the full posterior or sometimes the approximate posterior.