0:00

[MUSIC]

Â In the previous video we decided to solve the clustering

Â problem by making a probabilistic model of our data.

Â But how? How can we model our

Â data probabilistically?

Â Well we don't know that many distributions so far.

Â But we know Gaussians, right?

Â So in week one we learned how to feed the parameters of the Gaussian

Â distribution into the data set.

Â And we can do it here, it's a reasonable thing to do.

Â But it turns out that Gaussian is not the very best model for this kind of data.

Â Recall that we decided that our data consist of several clusters or

Â groups which may be far away from each other.

Â And even with this simple example, we can see that Gaussian has

Â to model all the data points as one big circle or maybe ellipse.

Â And in this case, It just has to assign hyper mobility to the center of

Â the circle, it's the way Gaussian works.

Â And you can see here that the center of the Gaussian kind of fall into

Â the region in between the clusters where there are not that many data points.

Â And this is problem with modeling this kind of data with one Gaussian.

Â We have to model everything with one big circle, and

Â we have some regions in between the clusters where there are few data points.

Â But the Gaussian just has to assign hypergrowth to these regions.

Â So, what else can we do?

Â Are there any better probabilistic models we can use here?

Â Well, if one Gaussian doesn't work let's use several of them like three for

Â example.

Â So in this case we are kind of assuming that each data point came from one of

Â the three Gaussians, and each Gaussian explains one cluster of the data points.

Â Or if you put it more formally then the density of each data point

Â equals to the weighted sum of three Gaussian densities, and

Â the weights by are some negative number which sum

Â up to 1 to make an actual probability distribution.

Â So the parameters theta here are the weights by.

Â And three groups of parameters of three Gaussians.

Â Their locations, the mu vectors, and their shapes, or

Â their covariance matrices sigma.

Â Notice that if we succeed in fitting this kind of model into our data,

Â we will solve clustering problem.

Â Because for each data point,

Â we may now find from which Gaussian this data point came from.

Â And this is exactly the alternative to finding the cluster index.

Â So we may say that all the points that came from one

Â Gaussian is the points of one particular cluster.

Â This models is sometimes called Gaussian Mixture Model, or GMM for short.

Â So great, we have found model which we would like to use.

Â What are some positive and negative features of this model,

Â compared to the just plain one Gaussian?

Â Well obviously, Gaussian is much less flexible.

Â So Gaussian Mixture Model allowed us to fit our complicated dataset, and

Â it actually turns out that you may fit just almost any probability

Â distribution with Gaussian Mixture Model with arbitrarily high accuracy.

Â Well of course as always happens with this kind of theorems so in the worst case you

Â may have to use exponentially many Gaussians, so it's not a very practical

Â theorem, but anyway Gaussian Mixture Model is very powerful and flexible model.

Â The downside is obviously the number of parameters.

Â So, as you increase the number of Gaussians you use for

Â your data you increase the number of parameters by the same texture right?

Â Okay great.

Â We decided to use this kind of Gaussian mixture model.

Â But how can we fit it?

Â How can we find its parameters?

Â So it's bi, mu and sigma vectors and matrices.

Â Well the simplest way to fit a probability distribution is to use

Â maximum likelihood estimation right?

Â So we want to find the values of parameters which maximise the likelihood,

Â which is the density of our data set given the parameters.

Â And we want to maximise this thing with respect to the parameters.

Â It's usually much in learning, we will assume that the data set consists

Â of n data points, which are independent given our parameters.

Â Which basically means that we can factorize the likelihood.

Â So the likelihood equals the product of likelihood of individual objects.

Â 5:02

Well, One more thing we can do to

Â simplify this expression, to understand how to maximize it,

Â is to substitute the definition of the marginal likelihood of xi given

Â the parameters by using the definition from a few slides before.

Â So each data point density is a mixture of Gaussian densities, right?

Â Okay, so now we have this kind of optimization problem.

Â And just one more thing we forgot here is, we have some constraints, right?

Â We have to say that the weights pi are non-negative, and that they sum up to 1.

Â Because otherwise, it will not be an actual probability distribution.

Â But now it seems like we're good to go.

Â Now we may use your favorite stochastic optimization algorithm from data flow,

Â like item or whatever you would like to use,

Â and we can optimize this thing to find the optimal parameters, right?

Â Well, it turns out that we kind of forgot one important set of constraints here.

Â The covariance [INAUDIBLE] as a sigma cannot be arbitrary.

Â Imagine that you have,

Â that your optimization logarithm proposed to use covariance matrix with all zeros.

Â It just doesn't work. It doesn't define a proper Gaussian

Â distribution.

Â Because in Gaussian distribution definition you have to Invert this matrix,

Â and you have to compute its determinant, and divide by it.

Â So if you have a matrix which is all 0s,

Â you will have lots of problems like division by 0, and stuff like that.

Â So it's not a good idea to assume that the covariance matrix can be anything.

Â And actually, the set of valid covariance matrices is something called Positive.

Â Don't worry if you don't know what it is, it's not the important right now.

Â The important part is that it's a really hard constraint to follow,

Â so it's hard to adapt your favorite stochastic rate in descent to

Â always follow of this kind of restraint.

Â So to maintain this property that the matrices are always positive,

Â semi-definite I don't know how to do it efficiently, so we have a problem here,

Â we don't know how to optimize this thing, at least with stochastic rate and descent.

Â Well, it turns out that even if you get this constraint, so

Â if you consider a simpler model, for

Â example if you say that all the covariance matrices are diagonal, which

Â means that the ellipsoids that corresponds to the Gaussians cannot be rotated.

Â They have to be aligned with the axis.

Â In this case it's much easier to maintain this constraint.

Â And you can actually use some stochastic optimization here.

Â So for example, in this example I used Adam and

Â I tuned it learned grade to optimize this thing.

Â And you can see that it's doing a reasonable job here.

Â So the blue curve is the, Performance of the item here.

Â On the x-axis we see epochs, and on the y-axis we see log likelihood,

Â which we are trying to maximize.

Â And so, Adam is doing a good job, right?

Â It's like ten e-books optimizing this thing to to something reasonable.

Â I think real line here is the ground for which came from minor because I know

Â from which probability distribution I generate this data, so,

Â I know the optimal value for the log likelihood.

Â But it turns out that even in this case where we don't have this very complicated

Â constraints you can make so much better by exploiting the structure of your problem.

Â And this is something we're going to discuss in the rest of the week,

Â is something called the expectation maximization.

Â And if you apply it here, it just works so much better.

Â In like a few iterations it found the value,

Â which is better than the Ground Truth, which probably is overfitting,

Â but anyway it works good on the test set also.

Â So to summarize, we may have two reasons to not

Â to use this stochastic gradient descent here.

Â First of all, it may be hard to follow some constraints which you may care about,

Â like positive semidefinite covariance matrices.

Â And second of all, expectation maximization algorithm,

Â which can exploit the structure of your program,

Â sometimes is much more faster and efficient.

Â So as a general summary we discussed that the Gaussian mixture model is a flexible

Â probability distribution, which can solve the clustering problem for

Â you if you fit your data into this Gaussian mixture model.

Â And sometimes it's hard to optimize for the stochastic gradient descent, but

Â there is this alternative which we'll talk about in the next video.

Â