Hello and welcome back. This video, I want to talk about Gaussian

mixture models and its connection to sparse modeling.

And in particular, this is going to be useful for us to introduce the concept of

structured sparse models. Let's get into that.

As examples, we are going to use the same type of image processing tasks that we

have been using for regular sparse modeling in the previous videos.

What we have is that from an observation y, we want to recover the image f.

Now, the image f has been deteriorated by this operator U and has additive Gaussian

noise as we have seen before. What type of operators are we going to

consider, are we going to have in mind? The same type that we have seen before.

For example, U can be the masking. Basically, we let some pixels go through,

some pixels we block. This is why, basically, our observation

and we wanted to go back to f via inpainting.

U can be a convolution, so we're going to blur the image.

Once again, this is our y, this is the observation,

and we want to go back to f, do deblurring.

Or, we can do sub-sampling, basically, we want to go from a small

image to a larger image and that's zooming.

So, these are the type of operations that we are going to think and have in mind.

Now, let us describe the type of models that we are going to have in mind.

And the basic idea is that, for the moment, we are not thinking about sparse

modelling for f. We actually are going to be thinking

about Gaussian mixture models for f. Now, as we have been doing for sparse

modelling, we don't work with the whole image,

we work with patches. Once again, we work with all the

overlapping patches, all possible patches in the image.

And the idea is that we are going to mold it,

each one of these patches, f sub i. Again, that has been deformed.

Sometimes the operator is the same operator all across the image,

sometimes it's not, and that's why I mark it U sub i and this is our observation.

The noise is, as before, Gaussian. Now, we're going to model these signals

with K different Gaussians. So, basically we have, if the patch is

eight by eight. We are, we, again, have a 64-dimensional

vector as before, and we are going to model with K Gaussians, K 64-dimensional

Gaussians. For a Gaussian, we actually need to

compute the average. For each one of the K, just to have a

number in our head, we're going to make K = 10.

We are going to see that we only needed a few Gaussians.

So, we need to compute the average of this 64-dimensional vector and the

covariance of this 64-dimensional vector. Each one of the Gaussians, we need to do

that. Now, a Gaussian having a model as a

Gaussian is equivalent to having a model as a PCA or Principal Component Analysis.

The basic idea is that this covariance matrix, we can compute the item vectors

and the item values and the item vectors give us a dictionary.

64 atoms in that dictionary if we are in this dimension.

So each one of the Gaussians, let's say 10,

gives us 64 atoms. We end up with a dictionary of 640 atoms

in this case, but in a very particular form, because each block of 64 atoms are

the eigenvectors that correspond to this, basically, this covariance matrix.

So that's our structure right now. And the basic idea is that each one of the

patches is modeled by one and only one of these.

And what we have to do is, from the noisy image, we are going to have to estimate

the noisy or the blur or the sub sample image.

If we are talking about in painting, we are going to have to estimate the

Gaussians, the 10 Gaussians, meaning, we are going to have to estimate

their average, we are going to have to estimate their covariance.

That's one thing that we have to estimate.

We also, are going to have to estimate for each patch, which one of these K or

10 Gaussians fits that patch the best. And then, once we do that, we're going to

have to obtain from the observation, reconstruct the actual image.

So, a lot of things that we are going to have to do.

Estimate the K Gaussians, estimate the identity of every patch, and

also then estimate the reconstruction. It turns out, although, this looks as a

very complicated problem, it turns out it is not.

It's a very simple problem and it can be efficiently solved with what's called a

maximum a posteriori expectation maximization.

The basic idea is that, if we know the Gaussian,

so, if we know every patch, the best Gaussian for it.

We have to do nothing else but linear operations with these type of filters to

do the reconstruction. So, it's piecewise linear for every

Gaussian, for every Gaussian that we use to feed the particular patch, we just

need a linear operator and, this is a very, very, simple algothorim.

Once again, we are approximating basically every one of the patches, but

by one Gaussian and it's kind of selecting from this very large

dictionary, one block of 64 atoms which corresponds to the Gaussian that we are

selecting. Now, the MAP-EM alternates between the

estimation of the Gaussian and estimation of the, basically, the appropriate, the

appropriate Gaussian for a given signal or for a given patch.

So we are doing, again, we are doing these maximum a

posteriori expectation maximization and that does more or less the following.

Estimates all of the Gaussians, then basically says, okay, which patches

like Gaussian number one more than any other Gaussian,

and then, with all those patches, reestimates the Gaussian number one.

And it's very similar to what we talk about the K-SVD, but instead of

estimating the dictionary atoms, we basically estimate a whole Gaussian,

meaning an average at covariance matrix or at PCA basis.

And then we keep iterating, normally, we do that for three, four

iterations until we converge. What we observe here is we have color

coded this image according to the Gaussian that a given pixel is selecting.

A pixel meaning the patch around it is selecting.

This is the initialization, or after one or two iterations, and this is after

that. And what we see is that similar regions

of the image end up selecting the same Gaussian.

Once again, we only allow, in this type of approach, to select a single Gaussian

for every patch, meaning for every pixel. And again, we are going to do the

recovery by averaging all the patches that touch the pixel as we have been

doing for sparse modeling. So now,

looks like we are not doing sparse modeling,

looks like what we are doing is representing every patch as one out of K

Gaussians. But let's see that what we are actually

doing is a very particular case of sparse modeling.

Why is that? In ordinary sparse modeling, we have a

large dictionary, normally, overcomplete,

and we are allow to select L atoms. So we have K here, a large K,

and we are allowed to select L atoms. In this case, three.

And remember, how many options do we have?

We have L to choose from K, and that is normally a very large number.

If we take the regular sizes that we are using for image processing by just which

are 5x5 or 8x8, and then, a dictionary which is 256 or 512 or 1000, and in

sparsity in the order of five to twelve five to ten,

this number is about this order. It's a huge number, a very large number

of subspaces that we can pick from that has the advantages, the advantage of

being a very rich model, but on the other hand, makes the model very unstable.

There is so much to choose from than basically if we make a small change in

the image or in the patch, we might end up selecting something completely

different. In, on the other hand, when we are

talking about the Gaussian mixture model, which as we say, is equivalent,

equivalent to a principal component analysis.

So we take the covariance matrix and we decompose it to get the principal

component analysis from the eigenvectors of the covariance matrix and then we

concatenate them. So this is one Gaussian,

another Gaussian, another Gaussian, another Gaussian, and so on.

And we have K Gaussians, let's say, 10.

Here are the eigenvectors corresponding to the covariance for the first Gaussian,

the second, the third, the fourth, and so on.

Here, I show that for five. Now, when you pick a Gaussian, you pick

the entire block. And then you'd fit the best you can that

Gaussian or you use the PCA basis as the atoms of your dictionary.

But look what happened. We went from this huge number to only

about 10 or 20 options, because once you pick the Gaussian, I mentioned it's in

your operation, you just project into it and you already have the atoms.

You already have everything. So in both cases, you learn the

dictionary. But here, you learn a dictionary where

every atom is independent and basically the coding with the atoms is independent.

Here we learn a dictionary that has a block structure.

And also when you are representing, when you are projecting, you pick entire

blocks. And this is a concept that is called

structure sparsity, where basically, the atoms have

correlations and they're either picked together or they are not picked at all.

This is a very particular case of what's called block structure.

There is no overlapping, because you are only allowed to pick one of the Gaussians

and there is no, basically in principle, there is no atom that belongs to more

than one of the Gaussians if the Gaussians are different.

So you only pick one block, which is a particular case of structure sparsity.

The overall concept of structure sparsity, being again, the fact that

atoms come and go together. They'll learn together, they are used to

encode the signals together, and that stabilizes your system much more because

the number of options is much lower. You either pick one and not the others or

you pick the other, but you're not allowed to pick another

from here, another from here, and another from here,

you pick in blocks. Now, of course, as before, there is a lot

of very interesting theory here, but it's also very important to know if

this is actually useful. If I need 10 million Gaussians, I'm not

gaining much. But actually, turns out that with 10 or

20 Gaussians, you can get basically some of the, some of the same results that we

saw for ordinary sparse coding, but with a much simpler algorithm and a

much more stable one. So it's,

again, it's sparse coding, sparse modeling, but in a structure fashion.

Let me show you a few examples of this. This is one of the examples that we have

seen before, but now with, with this Gaussian mixture

model or with this structure sparsity. Or basically, piecewise linear,

because once we picked the Gaussian, we are in the linear area.

Again, we have the original here, it's a zoom in to the original,

only 20% available, and here is the right construction.

We have seen a similar example with sparse coding.

Now, we are seeing it with the Gaussian mixture models.

Another example is zooming, we start from a very small image and we basically zoom

in, we want to make it larger and we get really, really nice results with this

technique, very sharp edges. So, this is another type of structure,

another type of sparsity with structure incorporated and also we get a very

strong connection with a classical model which is these Gaussian mixture models.

With this, we are actually concluding the week on sparse models.

We could spend a whole year teaching about sparse models, but we have provided

the key concepts. What this part coding, sparse coding, we

have provided the concept of how to learn the dictionaries, we have provided some

of the theory. We have talked about some of the theory. We have talked about some

of the computational techniques and we talked about the relationship between

sparse modeling and more classical techniques like Gaussian mixture models.

And from that, the new concept of structure sparsity.

This, as I say, is a topic which is very active research currently in the image

and signal processing community. And I hope you have enjoyed this week.

And I'm looking forward to see you next week once again.

Thank you very much.