[MUSIC]

In this section we will review Dropout, and

review its connections with a Bayesian framework.

So Dropout was invented in 2011, and became popular regularization technique.

We know that it works.

We know that its params are fitting.

And the essence of Dropout is actually just injection of the noise to

the variance, or to the activations at each iteration of our training.

The magnitude of this noise is defined by user, and is usually called dropout rate.

The noise can be different.

For example, it can be Bernoulli noise, then we are talking about binary dropout.

Or it can be Gaussian noise, then we tell about Gaussian dropout.

Let us review Gaussian dropout in details.

At each iteration of training, we generate Gaussian noise with a mean of 1 and

variance Alpha.

Then I multiply each weight Theta ij to Epsilon ij,

Epsilon ij is this noise generated from Gaussian distribution,

and obtain noisified versions of the waves, wij.

Let us consider Gaussian dropout in details.

Additional iteration of training,

we multiply always Theta ij by Gaussian noise, Epsilon ij.

Epsilon ij goes from a Gaussian distribution with a mean of 1 and

variance Alpha.

Then I obtain noisified versions of the weights, wij.

And finally, we compute stochastic gradients of a triangle

gradient given these noisified weights, w.

But then we obtain exactly the same stochastic gradient as it would

be if we optimized the expectation with respect to Gaussian distribution over w.

With weight Theta, and variance Alpha Theta squared, obtaining our likelihood.

So the distribution itself is fully factorized.

To show it, let us first perform a little regularization trick.

So we change the distribution over w to the distribution over Epsilon.

Epsilon now has a min of 1 and variance of Alpha, and it is still fully factorized.

And the old likelihood is computed at the point Theta times Epsilon.

Now our probability density doesn't depend on theta, and

we may move the differentiation inside of our integral.

And then we may change the integral to its Monte Carlo estimate, and

obtain exactly the same expression as it was on the previous slide.

So now we know that Gaussian dropout optimizes the following objective.

This expectation with respect to the distribution over w,

the distribution is fully factorized and its Gaussian, with a mean of Theta ij and

variance Alpha Theta ij squared.

And the expectation is computed from obtaining a likelihood.

So this looks pretty much the same like the first term in ELBO, where as

an regularization approximation we used fully factorized Gaussian distribution.

But where is the second term?

Where is the KL divergence?

Remember that ELBO consists of 2 terms, the data term and

the negative KL divergence, that is our regularizer.

In Gaussian dropout,

we have shown that we are optimizing with respect to Theta of just the first term.

So we managed to find such prior distribution, p(W),

that the second term will depend only on Alpha, and it will not depend on Theta.

Now I've managed to prove that this two procedures are exactly equivalent.

So remember that in Gaussian dropout, Alpha is assumed to be fixed.

And if Alpha is fixed, then optimization of ELBO is equivalent to

the optimization of just the first term with respect to Theta.

Surprisingly such prior distribution exists, and

it is known from information theory.

So this is a so-called improper log-uniform prior.

It is fully factorized again,

and each of its factors is proportional to 1 over absolute value of wij.

This is improper distribution, so it can not be normalized.

Nevertheless, it has several quite nice purpose.

For example, if we consider the algorithm of absolute value of wij, in other words,

it is to show that it will be uniformly distributed from minus to plus infinity.

And again, this is improper probability distribution.

For us, it is important that this prior distribution, roughly speaking,

analysis the precision with which we are trying to find wij.

We may easily show that the counter divergence between our Gaussian numeration

approximation and such kind of prior distribution will be dependent only

on Alpha, and will not depend on Theta.

The counter divergence is still intractable function, but

now it is a function of just one-dimensional parameter Alpha,

and it can be easily approximated by smooth, differentiable function.

So in the figure you see the black dots.

This is the exact values of our KL divergence,

given different values of Alpha.

And the red curve is our smooth, differentiable approximation.

And the existence of this smooth, differential approximation means that

potentially we may optimize the KL divergence with expect to Alpha.

And hence, we optimize the ELBO with respect to both Theta and Alpha.

And this is what we are going to do in the next lecture.

So to conclude, dropouts is a popular regularization technique.

The essence of dropout is simple injection of the noise in each iteration you obtain.

In this lecture, we have shown that one of the popular dropouts,

so-called Gaussian dropout,

is exactly equivalent to a special kind of generalization Bayesian procedure.

And this understanding, the understanding that dropout is a particular case

of Bayesian inference, allows us to construct various generalizations of

dropout that may possess several quite interesting properties.

We'll review one of them in the next lecture.

[MUSIC]