0:00

In this video, I will introduce Restricted Boltzmann Machines.

These have a much simplified architecture in which there are no connections between

hidden units. This makes it very easy to get the

equilibrium distribution of the hidden units if the visible units are given.

That is, once you've clamped the datavector on the visible units,

The equilibrium distribution of the hidden units can be computed exactly in one step

because they're all independent of one another, given the states of the visible

units. The proper Boltzmann machine learning

algorithm is still slow for a restricted Boltzmann machine.

But in 1998, I discovered a very surprising shortcut that leads to the

first efficient learning algorithm for Boltzmann machines.

Even though this algorithm has theoretical problems, it works quite well in practice.

And it led to a revival of interest in Boltzmann machine learning.

In a restricted Boltzmann machine, we restrict the connectivity of the network

in order to make both inference and learning easier.

So, it only has one layer of hidden units and there's no connections between the

hidden units. There's also no connections between the

visible units. So, the architecture looks like that, it's

what computer scientists call a bipartite graph.

There's two pieces, and within each piece, there's no connections.

1:37

That means with a datavector clamped, we can quickly compute the expected value of

vihj because we can compute the exact probability with each j will turn on, and

that is independent of all the other units in the hidden layer.

1:55

The probability that j will turn on is just the logistic function of the input

that it gets from the visible units and quite independent of what other hidden

units are doing. So, we can compute that probability all in

parallel and that's a tremendous win. If you want to make a good model of a set

of binary vectors, then the right algorithm to use for a restricted

Boltzmann machine is one introduced by Tieleman in 2008 that's based on earlier

work by Neal. In the positive phase, you clamp the

datavector on the visible units. You then compute the exact value of the

expectation vihj for all pairs of invisible in the hidden unit.

And you could do that cuz vi is fixed, and you can compute vj exactly.

2:58

For the negative phase, you keep a set of fantasy particles that is global

configurations. And then, you update each fantasy particle

a few times by using alternating parallel updates.

So, after each weight update, you update the fantasy particles a little bit and

that should bring them back to close to equilibrium.

3:21

And then, for every connected pair of units, you average vihj over all the

fantasy particles, and that gives you your negative statistics.

This algorithm actually works very well, and allows RBMs to build good density

models or sets of binary vectors. Now, I am going to go on to our learning

algorithm that is not as good at building density model but is much faster.

So, I'm going to start with a picture of an inefficient learning algorithm for

restrictive Boltzmann machines. We're going to start by clamping a

datavector on the visible units, and we're going to call that time t0.

So, we're going to use times now, not to denote weight updates, but to denote steps

in a Markov chain. Given that visible vector, we now update

the hidden units. So, we choose binary states for the hidden

units and we measure the expected value, vihj, for all pairs of visible and binary

units that are connected. And I'll call that vihj zero to indicate

that it's measured at time zero, With the hidden units being determined by

the visible units. And, of course, we can update all the

hidden units in parallel. We then use the hidden vector to update

all the visible units in parallel, and again we update all the hidden units in

parallel. So, the visible vector t1 = one, we'll

call a reconstruction, or a one-step reconstruction,

And we can keep going with the alternating chain that way,

Updating visible units, and then hidden units,

Each set being updated in parallel. And after we've gone for a long time,

5:10

We'll get to some state of the visible units, or I'll call t infinity to indicate

it needs to be a long time and the system will be at thermal equilibrium, and now,

we can measure the correlation of vi and hj after the chains run for a long time

and I'll call that vihj infinity. And the visible state we have after a long

time, I'll call it fantasy. So now, the learning rule is simply, we

change Wij by the learning rate times the difference between vihj at time zero and

vihj at time infinity. And, of course, the problem with this

algorithm is that we have to run this chain for a long time before it reaches

thermal equilibrium. And if we don't run it for long enough,

the learning may go wrong. In fact, that last statement is very

misleading. It turns out that even if we only run the

chain for a short time, the learning still works.

6:13

So, here's the very surprising shortcut. You just run the chain up, down, and up

again. So, from the data, you generate a hidden

state, from that. You generate a reconstruction, and from

that, you generate another hidden state. And you may have a statistics once you've

done that. So, instead of using the statistics

measured at equilibrium, we're using the statistics measured after doing one full

update of the Markov chain. The learning rule is, and the same as

before, except this much quicker to compute, and this is clearly is not doing

maximum likelihood learning because the term we are using for negative statistics

7:00

is wrong. But the learning, nevertheless, works

quite well. Next week, we'll understand a bit more

about why it works well. But for now, we'll just see that it does.

7:14

So, the obvious question is why does actual cut work at all?

And here's the reasoning. If we start the chain at the data, the

Markov chain will wander away from the data and towards its equilibrium

distribution. That is towards things that is initial

weights like more than the data. We can see what direction it's wandering

in after only a few steps. And if we know the initial weights aren't

very good, it's a waste of time to go all the way to equilibrium.

We know how to change them to stop it wandering away from the data without going

all the way to equilibrium. All we need to do is lower the probability

of the reconstructions of confabulations as a psychologist would call them, it

produces after one full step, and then, raise the probability of the data.

8:31

Here's a data point on the energy surface, and by data point, I mean, both the

visible vector and the particular hidden vector that we got by stochastic updating

the hidden units. So, that hidden vector is a function of

what the data point is. So, starting at that data point, we run

the Markov chain for one full step to get a new visible vector and the hidden vector

that goes with it. So, a reconstruction of the data point

plus the hidden vector that goes with that reconstruction.

9:05

We then change the weights to pull the energy down at the data point, and pull to

the energy up the reconstruction. And the effect of that would be to make

the surface look like this. And you'll notice we're beginning to

construct an energy minimum at the data. You'll also notice that far away from the

data, things have stayed pretty much as they were before.

9:51

These low energy holes cause the normalization term to be big, and we can't

sense them if we use the shortcut. If we use persistent particles, where we

remembered their states, and after each update, we updated them a few more times,

then they would eventually find these holes.

They'd move into the holes, and the learning would cause the holes to fill up.

10:17

A good compromise between speed and correctness is to start with small weights

and to use CD1, that is contrust divergence with one full step to get the

negative data. Once the weights have grown a bit, the

Markov chain is mixing more slowly, and now we can use CD3.