Welcome to week five of our course.

This week we're going to talk about how to scale Bayesian methods to large data sets.

So, even like 10 years ago,

people used to think that Bayesian methods are mostly

suited for small data sets because first of all,

they're expensive, computation expensive.

So if you want to do full Bayesian inference on like one million training examples,

you are going to face lots of troubles.

And second of all, there may not be beneficial anyway in the case of

large data because people used to think that the main idea,

the main benefit of Bayesian methods is to utilize your model,

and to be able to extract as much information as possible from small data set.

And if you have free large data set, then you don't need that,

you can use any method you want and it will work just fine.

But things changed then,

Bayesian methods met deep learning,

and people started to make

some mixture models that has neural networks instead of a probabilistic model.

And this is what this week will be about,

how to combine neural networks with the Bayesian methods.

So we'll discuss that.

We'll discuss how to combine these two ideas.

We'll see a particular example of variational old encoder,

which allows you to generate nice samples,

nice images by using neural network which has some probabilistic interpretation.

And then, in the second module of Professor Dmitry Vetrov,

will tell you about scalable methods for Bayesian neural networks,

and about his cutting edge research in

this area that allowed him to compress neural networks by a lot,

and then to fight severe over fitting on some complicated data sets.

So, to start with,

let's discuss a little bit of the concept of estimation being unbiased.

We have already touched on that in the previous week, on week four,

on Markov Chain Monte Carlo,

but let's make our self a little bit more clear here.

We'll need that to build unbiased estimates for gradients of some neural networks.

So, let's say you want to estimate an expected value.

If you're using Monte Carlo estimation,

you will substitute that with an average

with respect to samples taken from that distribution, pure facts.

And this idea may look like this.

So here, the blue line is your distribution,

pure facts and you can generate samples from it like this.

And then you can take the average of your f_x on this data set and it can look like this,

for example like red cross here.

And this is actually a random variable.

So if you repeat this process,

if you generate other set of samples, and repeat,

and again look right down the average of them,

you will get some other approximation of your expected value.

And by repeating this process more and more times,

you can get samples from your random variable R.

And this random variable has its own distribution,

and its average, its expected value,

exactly equals to the expected value of f_x which we wanted to estimate.

So you can see that all these samples from the random variable R,

are close to the expected value which we want to estimate, are around it.

And which will basically means that if we use

more samples like hundreds samples for each estimation,

then we will make more accurate predictions.

So, then these samples of

the averages of R will like close to the expect value which we want to approximate.

And the more samples we use like this,

the more accurate the prediction becomes.

So, it more and more peaked around the true value.

And if you put it formally,

this is the definition of an unbiased estimate.

An estimate R is called unbiased if

its expected value equals to the thing we want to approximate.

So, if here is true, then

all the samples of R lies around the expected value which we want to approximate.

But how can it not be true?

Well, if you look for example at the log from the expected value,

and try to approximate it with Monte Carlo,

it's kind of natural to approximate it like log of the sample average.

But it turns out that it's not an unbiased estimate.

So, if you look at the samples here, they will be,

all the samples of the variant variable G,

will lay to the left of the actual expected value.

So, you're underestimating your true value which you want to approximate even on average.

So all this red crosses are like not around the true value,

but around some smaller value,

and thus you're not doing the right job,

you're doing biased estimation of your function logarithm of an expected value.

And to summarize, so an estimate is called unbiased if

its expected values equal to the thing which you want approximate.

And it's entirely non trivial to understand if your estimator is unbiased or not.

So for the simplest estimator,

an expected value of function can be unbiasedly estimated

as an average with respect to samples.

For anything more complicated than that,

you have to think carefully and check that you're not going to biased territory.

And if you don't want to check or if you can't do it,

then you're better to reduce

your particular problem to the form of just expected value of some function,

and then estimate this with these sample average.

So this is a way to go to be sure that your estimation is unbiased.