0:03

Okay. So, we decided to model our distribution

through facts by using the continuous mixture of Gaussians.

So, let's develop this idea.

To define this model fully,

we have to define the prior and the likelihood.

And let's define the prior to be just standard norm, because, why not.

It will just force the latent variables t

to be around zero and with some unique variants.

And the likelihood, we decide that we will use Gaussians, right?

With parameters that depend on t somehow.

So, how can we define these parameters,

these pro-metric way to convert t to the parameters of the Gaussian?

Well, if we use linear function for Mu of t with some parameters w and b and a constant

for sigma of t. Which this Sigma zero can

be a parameter or maybe like all these identity matrix, it doesn't matter that much.

We'll get the usual PPCA model.

And, this probabilistic PPCA model is

really nice but it's not powerful enough for our kinds of natural images data.

So, let's think what can we change to make this model more powerful.

If a linear function is not powerful enough for our purposes,

let's use convolutional neural network because it works nice for images data.

Right? So, let's say that Mu of t is

some convolutional neural network apply it to the latent called

t. So it gets as input the latent t and outputs your image or a mean vector for an image.

And then Sigma t is also a commercial neural network which

takes living quarters input and output your covariance matrix Sigma.

This will define our model in some kind of parametric form.

So we have them all like this.

And let's emphasize that we have some weights and then you'll input

w. Let's put them in all parts far off our model definitions.

Do not forget about them.

We are going to train the model to have them all like this.

So pre-meal to facts given the weights of neuron that are w is a mixture of Gaussians,

where the parameters of the Gaussians depends on

the leading variable t for a convolutional neural network.

One problem here is that if for example your images are 100 by 100,

then you have just 10000 pixels in each image and it's pretty low resolution.

It's not high end in anyway,

but even in this case,

your covariance matrix will be 10,000 by 10,000. And that's a lot.

So we want to avoid that and it's not so reasonable to

ask our neural network to output your 10,000 by 10,000 image, or matrix.

To get rid of this problem let's just say that our covariance matrix will be diagonal.

Instead of outputting the whole large matrix Sigma,

we'll ask our neural network to produce

just the weights on the diagonal of this covariance matrix.

So we will have 10,000 Sigmas here for example and we will

put these numbers on the diagonal of

covariance matrix to define the actual normal distribution,

or condition on the latent variable t. Now our conditional distributions are vectorized.

It's Gaussians with zero off diagonal elements in the covariance matrix, but it's okay.

Mixture of vectors as Gaussian is not a factor as distribution.

So we don't have much problems here.

We have our model fully defined,

now have to train it somehow.

We have to train.

The natural way to do it is to use maximum likelihood estimation

so to maximize the density of our data set given the parameters;

the parameters of the conventional unit neural network.

This can be redefined by a sum integral where we marginalize

out the latent variable t. Since we have a latent variable,

let's use expectation maximization algorithm.

It is specifically invented for these kind of models.

And in the expectation maximization algorithm,

if you recall from week two,

we're building a lower bond on the logarithm of this marginal likelihood,

P of x given w and we are lower modeling

this value by something which depends on w and some new variational parameters Q.

And then we'll maximize this lower balance with respect to

both w and q to get this lower bound

as high as possible as accurate so as close to the

actual lower for the margin look like what is possible.

And the problem here is that when you step off of the play

an expectation maximisation algorithm we have to use we

have to find the best years original latent variables.

And this is intractable in this case because you have to compute

some integrals and your integrals contains convolutional neural networks in them.

And this is just too hard to do analytically.

So E-M is actually not the way to go here. So what else can we do?

Well in the previous week we discussed the Markov chain Monte Carlo and we can

use we can use this MCMC to approximate M-step of the expectation maximisation.

Right. Well. This way on the amstaff

we instead of using the expected value with respect to the Q.

Which is in the posterior distribution on the latent variables from

the previous iteration in that we will approximate this expected value with samples,

with an average and then we'll maximize this iteration instead of the expected value.

It's an option we can do that.

Well it's going to be kind of slow because this way on each iteration of

expectation optimization you have to run like hundreds of situation of Markov chain.

Wait until have converged and then start to collect samples.

So this way you will have kind of a mess that loop.

You will have all the reiterations of expectation maximisation and iterations of

Markov chain Monte Carlo and this will probably not be very fast to do.

So let's see what else can we do.

Well we can try variational inference and the idea of variational inference

is to maximize the same lower bound

but to restrict the distribution you do be vectorized.

So for example if the later they will charge for each data object

is 50 dimensional then this Q

I of T I will be just a product of

50 one dimensional distributions so it's a nice way to go,

it's a nice approach.

It will approximate your expectation maximisation but it usually works and pretty fast.

But it turns out that in this case even this is intractable.

So in this approximation is not enough to get

an efficient method for training your latent variable model.

And we have to approximate even further.

So we have to drive even less accurate approximation to be

able to build an efficient method for treating this kind of model.