Well, One more thing we can do to

simplify this expression, to understand how to maximize it,

is to substitute the definition of the marginal likelihood of xi given

the parameters by using the definition from a few slides before.

So each data point density is a mixture of Gaussian densities, right?

Okay, so now we have this kind of optimization problem.

And just one more thing we forgot here is, we have some constraints, right?

We have to say that the weights pi are non-negative, and that they sum up to 1.

Because otherwise, it will not be an actual probability distribution.

But now it seems like we're good to go.

Now we may use your favorite stochastic optimization algorithm from data flow,

like item or whatever you would like to use,

and we can optimize this thing to find the optimal parameters, right?

Well, it turns out that we kind of forgot one important set of constraints here.

The covariance [INAUDIBLE] as a sigma cannot be arbitrary.

Imagine that you have,

that your optimization logarithm proposed to use covariance matrix with all zeros.

It just doesn't work. It doesn't define a proper Gaussian

distribution.

Because in Gaussian distribution definition you have to Invert this matrix,

and you have to compute its determinant, and divide by it.

So if you have a matrix which is all 0s,

you will have lots of problems like division by 0, and stuff like that.

So it's not a good idea to assume that the covariance matrix can be anything.

And actually, the set of valid covariance matrices is something called Positive.

Don't worry if you don't know what it is, it's not the important right now.

The important part is that it's a really hard constraint to follow,

so it's hard to adapt your favorite stochastic rate in descent to

always follow of this kind of restraint.

So to maintain this property that the matrices are always positive,

semi-definite I don't know how to do it efficiently, so we have a problem here,

we don't know how to optimize this thing, at least with stochastic rate and descent.

Well, it turns out that even if you get this constraint, so

if you consider a simpler model, for

example if you say that all the covariance matrices are diagonal, which

means that the ellipsoids that corresponds to the Gaussians cannot be rotated.

They have to be aligned with the axis.

In this case it's much easier to maintain this constraint.

And you can actually use some stochastic optimization here.

So for example, in this example I used Adam and

I tuned it learned grade to optimize this thing.

And you can see that it's doing a reasonable job here.

So the blue curve is the, Performance of the item here.

On the x-axis we see epochs, and on the y-axis we see log likelihood,

which we are trying to maximize.

And so, Adam is doing a good job, right?

It's like ten e-books optimizing this thing to to something reasonable.

I think real line here is the ground for which came from minor because I know

from which probability distribution I generate this data, so,

I know the optimal value for the log likelihood.

But it turns out that even in this case where we don't have this very complicated

constraints you can make so much better by exploiting the structure of your problem.

And this is something we're going to discuss in the rest of the week,

is something called the expectation maximization.

And if you apply it here, it just works so much better.

In like a few iterations it found the value,

which is better than the Ground Truth, which probably is overfitting,

but anyway it works good on the test set also.

So to summarize, we may have two reasons to not

to use this stochastic gradient descent here.

First of all, it may be hard to follow some constraints which you may care about,

like positive semidefinite covariance matrices.

And second of all, expectation maximization algorithm,

which can exploit the structure of your program,

sometimes is much more faster and efficient.

So as a general summary we discussed that the Gaussian mixture model is a flexible

probability distribution, which can solve the clustering problem for

you if you fit your data into this Gaussian mixture model.

And sometimes it's hard to optimize for the stochastic gradient descent, but

there is this alternative which we'll talk about in the next video.