0:00

In this video, I'll talk about a different way of learning sigmoid belief

notes. This different method arrived in an

unexpected way. I stopped working on sigmoid belief nets

and went back to Boltzmann machines. And discovered that restricted Boltz

machines could actually be learned fairly efficiently.

Given that a restricted Boltzmann machine could efficiently learn a layer of

nonlinear features. It was tempting to take those features, treat them as data,

and apply another restricted Boltzmann machine to model the correlations between

those features. And one can continue like this, stacking

one Boltzmann machine on top of the next one to learn lots of layers of nonlinear

features. This eventually led to a big resurgence

of interest in deep neural nets. The issue then arose. Once you stacked up

lots of restricted Boltzmann machines, each which is learned by modeling the

patterns of future activities produced by the previous Boltzmann machines.

Do you just have a set of separate restricted Boltzmann machines or can they

all be combined together into one model? Now, anybody sensible would expect that

if you combined a set of restricted Boltzmann machines together to make one

model, what you'd get would be a multilayer Boltzmann machine.

However, a brilliant graduate student of mine called G.Y.

Tay, figured out that that's not what you get.

You actually get something that looks much more like a sigmoid belief net.

This was a big surprise. It was very surprising to me that we'd

actually solved the problem of how to learn deep sigmoid belief nets by giving

up on it and focusing on learning undirected models like Boltzmann

machines. Using the efficient learning algorithm

for restricted Boltzmann machines. It's easy to train a layer of features

that receive input directly from the pixels.

We can treat the patterns of activation of those feature detectors as if they

were pixels, and learn another layer of features in a

second hidden layer. We can repeat this as many times as we

like with each new layer of features modelling the correlated activity in the

features in the layer below. It can be proved that each time we add

another layer of features, we improve a variational lower bound on the log

probability that some combined model would generate the data.

The proof is actually complicated, and it only applies if you do everything just

right, which you don't do in practice.

But, the proof is very reassuring, because it suggests that something

sensible is going on when you stack up restricted Boltzmann machines like this.

The proof is based on a neat equivalence between a restricted bolson machine and

an infinitely deep belief net. So here's a picture of what happens when

you learn two restricted Boltzmann machines, one on top of the other,

and then you combine them to make one overall model, which I call a deep belief

net. So first we learn one Boltzmann machine

with its own weights. Once that's been trained, we take the

hidden activity patterns of that Boltzmann machine when it's looking at

data and we treat each hidden activity pattern as data for training a second

Boltzmann machine. So we just copy the binary states to the

second Boltzmann machine, and then we learn another Boltzmann machine.

Now one interesting thing about this, is that if we start the second Boltzmann

machine off with W2 being the transpose of W1, and with as many hidden units in

h2 as there are in v, then the second Boltzmann machine will already be a

pretty good model of h1, because it's just the first model upside

down. And for a restricted Boltzmann machine,

it doesn't really care which you call visible and which you call hidden.

It's just a bipartite graph that's learned to model.

After we've learned those two Boltzmann machines, we're going to compose them

together to form a single model and the single model looks like this.

Its top two layers adjust the same as the top restricted Boltzmann machine.

So that's an undirected model with symmetric connections, but its bottom two

layers are a directed model like a sigmoid belief net.

So what we've done is we've taken the symmetric connections between v and h1

and we've thrown away the upgoing part of those and just kept the dangering part.

To understand why we've done that is quite complicated and that will be

explained in video 13F. The resulting combined model is clearly

not a Boltzmann machine, because its bottom layer of connections are not

symmetric. It's a graphical model that we call a

deep belief net, where the lower layers are just like sigmoid belief nets and the

top two layers form a restricted Boltzmann machine.

So it's a kind of hybrid model. If we do it with three Boltzmann machines

stacked up, we'll get a hybrid model that looks like this.

The top two layers again are a restricted Boltzmann machine and the layers below

are directed layers like in a sigmoid belief net.

5:29

To generate data from this model the correct procedure is,

first of all, you go backwards and forwards between h2 and h3 to reach

equilibrium in that top level restricted Boltamann machine.

This involves alternating Gibbs sampling, where you update all of the units in h3

in parallel, and update all of the units in h2 in parallel,

then go back and update all of the units in h3 in parallel. And you go backwards

and forwards like that for a long time until you've got an equilibrium sample

from the top-level restricted Boltamann machine.

So the top-level restricted Bolson machine is defining the prior

distribution of h2. Once you've done that, you simply go once

from h2 to h1 using the generative connections w2.

And then, whatever binary patent you get in h1, you go once more to get generated

data, using the weights w1. So we're performing a top-down pass from

h2, to get the states of all the other layers,

just like in a sigmoid belief net. The bottom-up connections, shown in red

at the lower levels, are not part of the generative model.

They're actually going to be the transposes of the corresponding weights.

So they're the transpose of w1 and the transpose of w2,

and they're going to be used for influence, but they're not part of the

model. Now, before I explain why stacking up

Boltzmann machines is a good idea, I need to sort out what it means to average two

factorial distributions. And it may surprise you to know that if I

average two factorial distributions, I do not get a factorial distribution.

What I mean by averaging here is taking a mixture of the distributions, so you

first pick one of the two at random, and then you generate from whichever one you

picked. So, you don't get a factorial

distribution. Suppose we have an RBM with 4 hidden

units and suppose we give it a visible vector.

And given this visible vector, the posterior distribution over those 4

hidden units is factorial. And lets suppose the distribution was

that the first and second units have a probability of 0.9 of turning on and the

last two have a probability of 0.1 of turning on.

What it means for this to be factorial is that, for example, the probability that

the first two units were both be on in a sample from this distribution, is exactly

0.81. Now suppose we have a different angle

vector v2, and the posterior distribution over the same 4 hidden units is now 0.1,

0.1, 0.9, 0.9, which I chose just to make the math easy.

If we average those two distributions, the mean probability of each hidden unit

being on, is indeed, the average of the means for each distribution.

So the means are 0.5, 0.5, 0.5, 0.5, but what you get is not a factorial

distribution defined by those 4 probabilities.

To see that, consider the binary vector 1, 1, 0, 0 over the hidden units.

In the posterior for v1, that has a probability of 0.9^4, because

it's 0.9 * 0.9 * 1 - 0.1 * 1 - 0.1. So that's 0.43.

In the posterior for v2, this vector is extremely unlikely.

It has a probability of 1 in 10,000. If we average those two probabilities for

that particular vector, we'll get a probability of 0.215,

and that's much bigger than the probability assigned to the vector 1, 1,

0, 0 by factorial distribution with means of 0.5.

That probability will be 0.5^4, which is much smaller.

So, the point of all this, is that when you average two factorial posteriors, you

get a mixture distribution that's not factorial.

Now, let's look at why the greedy learning works.

That is why it's a good idea to learn one restricted Boltzmann machine.

And then learn a second restricted Boltzmann machine that models the

patterns of activity in the hidden units of the first one.

The weights of the bottom level restricted Boltzmann machine, actually

define four different distributions. Of course, they define them in a

consistent way. So the first distribution is the

probability of the visible units given the hidden units.

And the second one is the probability of the hidden units given the visible units.

And those are the two distributions we use for running our alternating mark of

chain that updates the visibles given the hiddens and then updates the hiddens

given the visibles. If we run that chain long enough, we'll

get a sample from the joint distribution of v and h.

And so the weights clearly also define the joint distribution.

They also define the joint distribution more directly in terms of E to the minus

the energy, but for nets with a large number of

units, we can't compute that. If you take the joint distribution,

p(v|h), and you just ignore v, we now a distribution for h.

That's the prior distribution over h, defined by this restricted Boltzmann

machine. And similarly, if we ignore h, we have

the prior distribution over v, defined by the restricted Boltzmann machine.

And now, we're going to pick a rather surprising pair of distributions from

those four distributions. We're going to define the probability

that the restricted Boltzmann machine assigns to a visible vector v as the sum

over all hidden vectors of the probability it assigns to h times the

probability of v given h. This seems like a silly thing to do,

because defining p(h) is just as hard as defining p(v).

And nevertheless, we're going to define p(v) that way.

Now, if we now leave p(v|h) alone, but learn a better model of p(h),

that is, learn some new parameters that give us a better model of p(h) and

substitute that in instead of the old model we had of p(h).

We'll actually improve our model of v. And what we mean by a better model of

p(h) is a prior over h that fits the aggregated posterior better.

The aggregated posterior is the average over all vectors in the training set of

the posterior distribution over h. So, what we're going to do, is use our

first RBM to get this aggregated posterior and then use our second RBM to

build a better model of this aggregated posterior than the first RBM has.

And if we start the second RBM off as the first one upside down, it will start with

the same model of the aggregated posterior as the first RBM has.

And then, if we change the weights we can only make things better.

So, that's an explanation of what's happening when we stack up RBMs.

Once we've learned to stack up Boltzmann machines, then combine them together to

make a deep belief net, we can then actually fine-tune the whole

composite model using a variation of the wake-sleep algorithm.

So we first learn many layers of features by stacking up IBMs.

And then we want to fine-tune both the bottom-up recognition weights and the

top-down generative weights to get a better generative model and we can do

this by using three different learning routes.

First, we do a stochastic bottom-up pass, and we adjust the top down generative

weights of the lower layers to be good at reconstructing the feature activities in

the layer below. That's just as in the standard wake-sleep

algorithm Then, in the top level RBM, we go backwards and forwards a few times,

sampling the hiddens of that RBM, and the visibles of that RBM, and the hiddens of

the RBM, and so on. So that's just like the learning

algorithm for RBMs. And having done a few iterations of that,

we do contrastive divergence learning. That is, we update the weights of the RBM

using the difference between the correlations when activity first got to

that RBM and the correlations after a few iterations in that RBM.

We take that difference and use it to update the weights.

And then, the third stage, we take the visible units of that top-level RBM by

its lower level units. And starting there, we do a top-down

stochastic pass, using the directed lower connections, which are just a sigmoid

belief net. Then, having generated some data from

that sigmoid belief net, we adjust the bottom up rates to be good at

reconstructing the feature activities in the layer above.

So that's just the sleep phase of the wake-sleep algorithm.

The difference from the standard wake-sleep algorithm is that that

top-level RBM acts as a much better prior over the top layers, than just a layer of

units which are assumed to be independent, which is what you get with a

sigmoid belief net. Also, rather than generating data by

sampling from the prior, what we're actually doing is looking at a training

case, going up to the top-level RBM and just running a few iterations before we

generate data. So now we're going to look at an example

where we first learn some RBMs, stacking them up,

and then we do contrastive wake-sleep to fine-tune it,

and then we look to see what it's like. Is it a generative model?

And also if we're recognizing things. So first of all, we're going to use 500

binary hidden units to learn to model all 10 digit classes in images of 28 by 28

pixels. Once we've learned that RBM, without

knowing what the labels are, so it's unsupervised learning.

We're going to take the patterns of activity in those 500 hidden units that

they have when they're looking at data. We're going to treat those patterns of

activity as data and we're going to learn another RBM that also has 500 units,

and those two are learned without knowing what the labels are.

Once we've done that we'll actually tell it the labels.

So the first two hidden layers are learned without labels,

and then, we add a big top layer and we give it the 10 labels.

And you can think that we concatenate those 10 labels with the 500 units that

represent features, except that the 10 labels are really one

soft match unit. Then we train that top-level RBM to model

the concatenation of the soft match unit for the 10 labels with the 500 feature

activities that were produced by the two layers below.

Once we've trained the top-level RBM, we can then fine-tune the whole system by

using contrastive wake-sleep. And then we'll have a very good

generative model and that's the model that I showed you in the intro video.

So if you go back, and you find the introduction video for this course,

you'll see what happens when we run that model.

You'll see how good it is at recognition and you'll also see that it's very good

at generation. In that introductory video, I promised

you, you would eventually explain how it worked,

and I think you've now seen enough to know what's going on when this model is

learned.