In the previous video,

we discussed that we want to build an expectation-maximization algorithm,

for this simple discrete probabilistic fitting problem.

And we finished the E-step which is summarized in the bottom of this desk.

So now, let's go ahead and proceed with the M-step.

So, on the M-step,

we want to maximize the expected value,

maximized with respect to the parameters.

So in this case, alpha, beta,

and gamma with respect

to the objects in the data set of expected values,

with respect to q(ti) of joint distributions,

logarithms of the joint p(ti),

and (xi) which we can rewrite as a product of

p(xi/ti) times and again,

given the parameters, but I will omit them everywhere,

that's the prior of p(ti).

And we can rewrite this thing by using the definition of

the expected value which is sum with respect to all possible variables for (ti).

And there are two possible variables for (ti), here.

So, it will be sum,

the first term will be, corresponding to (ti=1).

So, for data points that are coming

from the first component of the mixture, here'll be logarithm,

here'll be the weight, probability (ti=1) times

the logarithm of

(xi/ti=1) times the prior,

and the prior here is just gamma.

So probability of (ti=1),

plus the second term

which is the same thing, but for (ti=1).

So, again sum with respect to all the objects in data set.

The variations fusion Q (ti=2) times

the logarithm of the joint distribution which is a logarithm of p(xi/ti=2),

just p2(xi) and times the prior which one minus gamma.

And we want to maximize this thing with respect to parameters.

So, with respect to alpha, beta, and gamma.

Let's rewrite this expression by using the definition of Q,

using the one we derived from the E-step,

and also substituting the definition of p1 and p2.

So here, we'll get something like this.

To start with, let's make more concrete, which data we have.

So, let's say that we observed

30 objects equals to xi=1.

So, N1=30, It's number of observations xi which

are equal to one, N2=20, and N3=60.

So this is a complete definition of our data set.

We have 31 and et cetera.

And now, we can proceed with this fitting algorithm.

So, now we're going to write this thing,

and instead of writing everywhere sum with respect to objects,

let's decompose this thing into free terms.

First of them, corresponding to the xi=1, and et cetera.

So the first term will be,

there will be 30 terms equals to each

other that corresponds to that xi that are equal to one.

Right? And so, q(ti=1).

In this case, will be just p ti=1 given that xi=.

So, it's the terms corresponding to xi=1, right?

And this thing is just one.

If we look up our results from the E-step.

This is one and times the logarithm of the joint distributions.

So, logarithm of p1 of one which is alpha,

times the prior which is gamma.

So, permuted that ti=1 is the first term of the first term here.

The second term will correspond to xi=2.

So there will be 20 of them.

The probability of q(ti),

in this case will be 0.5 from here.

So it will be times 0.5 times the logarithm of

the joint distribution which is one minus alpha times the prior.

It's p1 of two,

plus the third term.

So there will be 60,

xi=3, and the probability, this q(ti) will be zero.

So, this will be zero,

times logarithm of also zero which may look like a problem,

but actually, it's usually considered that zero times logarithm of zero is zero.

So, even if you do it carefully it will be okay.

You just kind of ignore this term in maximization problem.

This thing is ignored because it's zero.

Plus the terms corresponding to this part.

So, plus the terms corresponding to

the xi that came from the second component of the mixture.

Again, there will be 30,

xi=1 times the variation probability of q(ti=1).

And this thing could be.

p(ti=2).

So, second component of the mixture,

given xi is 1,

and this is zero, right?

So, we assume that no one's came from the second component of the mixture.

So again, it was times logarithm of zero,

but anyway, we ignored this part.

This is zero, and plus the final two terms which

is 20 times 0.5 which is q,

times logarithm of one minus beta,

times one minus alpha,

or one minus gamma which is prior probability for the second component of the mixture.

And this last term is

60 times 1 times logarithm of beta,

times the prior probability for second component which is one minus gamma.

So finally, we have this expression,

and we want to maximize it with respect to alpha, beta, and gamma.

So, let's try to simplify it a little bit.

So, get lid of the zeros,

and things like that that can make things harder for us.

And also, move all the bar sets,

it depends on the same parameter together.

So let's start with maximizing but with respect to alpha for example.

Let's identify the terms that depend on alpha.

So this thing will equal to...

So, this part does depend on alpha.

It will be 30 times logarithm of alpha,

plus 30 times logarithm of gamma,

but we don't care about it,

because it's a cost with respect to alpha.

And now, we want to maximize with respect alpha.

So, plus 10 times logarithm of one minus alpha,

and there are no alphas anywhere else.

So, plus constant with respect to alpha.

And we want to maximize this expression.

How can we do it? Well, let's take the gradient,

and set it to zero as usually.

So, the gradient of this expression with respect to alpha will be

30 times gradient of logarithm of alpha which is one divided by alpha,

plus 10 times gradient of logarithm of one minus alpha,

it's one divided by one minus alpha, minus one.

And this thing, equals to zero.

And finally, we can solve this equation by now,

we're just writing down this ratio.

30 divided by alpha,

equals to 10 divided by one minus alpha,

and by multiplying each part by one minus alpha, and alpha,

we can get that, well,

30 minus 30 alpha,

equals to 10 alpha.

And finally, which means that alpha equals to 30 divided by 40.

So, we found our alpha after the M-step.

So we updated the parameters of our alpha variable model.

We started from the initialization,

alpha equals beta, equals gamma, equals 0.5,

and after one step of the expectation maximization algorithm,

alpha started to be three divided by four.

So, we can solve the same optimization problem with respect to beta and gamma,

and get the following numbers.

So, beta will be six divided by seven,

and gamma will be four divided by 11.

So, we found our parameters, and actually,

it turns out that if you repeat this expectation maximization steps a few more times,

you will not update the parameters anymore.

So, we have already converge up to one iteration of expectation maximization.

And this parameter kind of make sense.

They say that the first component of a mixture mostly focus on [inaudible] ones.

So, the probability of changing one is three divided by four.

It's larger than 0.5,

and the second component of the mixture focus on number threes.

And we're a little bit less in terms of the first component and of the second one,

because there are less one's and three's.

So, last week, consult in details

this one step of the expectation maximization algorithm for a discrete mixture model.