0:00

In this video, I'm going to talk about the mixture of experts model that was

developed in the early 1990s. The idea of this model is to train a

number of neural nets, each of which specializes in a different part of the

data. That is, we assume we have a data set

which comes from a number of different regimes,

And we train a system in which one neural net will specialize in each regime, and a

managing neural net will look at the input data, and decide which specialist to give

it to. This kind of system, doesn't make very

efficient use of data, because the data is, fractionated over all these different

experts. And so with small data sets, it can't be

expected to do very well. But as data sets get bigger, this kind of

system may well come into its own, because it can make very good use of extremely

large data sets. In boosting, the weights on the models are

not all equal, But after we finish training, each model

has the same weight for every test case. We don't make the weights on the

individual models depend on which particular case we're dealing with.

In mixture of experts, we do. So the idea is that we can look at the

input data for a particular case during both training and testing to help us

decide which model we can rely on. During training this will allow models to

specialize on a subset of the cases. They then will not learn on cases for

which they're not picked. So they can ignore stuff they're not good

at modeling. This will lead to individual models that

are very good at some things and very bad at other things.

2:18

To fit it, you just store the training cases.

So, that's really simple, And then if you have to predict Y from X,

you simply find the stored value of X that's closest to the test value of X,

then you predict the value of Y that's the same as for the stored value.

The result of that is that the curve relating the input to the output consists

of lots of horizontal lines connected by cliffs.

It would clearly make more sense to smooth things out a bit.

At the other extreme, we have fully global models, like fitting one polynomial to all

the data. They're much harder to fit to data, and

they may also be unstable. That is, small changes in the data may

cause big changes in the model you fit. That's because each parameter depends on

all the data. In between these two ends of the spectrum,

we have multiple local models, that are of intermediate complexity.

3:16

This is good if the data set contains several different regimes and those

different regimes have different input/output relationships.

In financial data for example the state of the economy has a big effect on

determining the mappings between inputs and outputs, and you might want to have

different models for different states of the economy.

But you might not know in advance how to decide what constitutes different states

of the economy, you're going to have to learn that too.

4:09

But we don't want to cluster the data based on the similarity of input vectors.

All we're interested in is the similarity of input-output mappings.

So if you look at the case on the right, there's four data points that are nicely

fitted by the red parabola and another four data points that are nicely fitted by

the green parabola If she partition the data based on the input I put mapping,

that is based on the idea that a parabola will fit the data nicely, then you

partition the data where that brown line is.

4:45

If however you partitioned the data by just clustering the inputs, we partition

where the blue line is, and then if you looked to the left of that blue line,

you'll be stuck with a subset of data that can't be modeled nicely by a simple model.

So I'm going to explain an error function that encourages models to cooperate.

And then I'm going to explain an error function that encourages models to

specialize. And I'm going to try to give you a good

intuition for why these two different functions have these very different

effects. So if you want to encourage cooperation,

what you should do is compare the average predictors with the target and train all

the predictors together to reduce the difference between the target and their

average. So using angle back as for expectation

again, the error would be the difference between the target and the average of all

the predictors of what they predict. That will overfit badly.

5:59

So, if you're averaging models during training, and training so that the average

works nicely, you have to consider cases like this.

On the right, we have the average of all the models except for model I.

So, that's what everybody else is saying when their votes are averaged together.

On the left, we have the output of model I.

Now if we'd like the overall average to be closer to the target, what do we have to

do to the output of the Ith model? We have to move it away from the target.

That will take the overall average towards the target.

You can see that what's happening is model I is learning to compensate for the errors

made by all the other models. But do we really want to move model I in

the wrong direction? Intuitively it seems better to move model

I towards the target. So here is an arrow function that

encourages specialization, and it's not very different.

To encourage specialization, we compare the output of each model with the target

separately. We also need to use a manager to determine

the weight we put on each of these models, which we can think of as the probability

of picking each model, if we have to pick one.

7:22

So now, our error is the expectation over all the different models of the squared

error made by that model times the probability of picking that model,

Where the manager or gating network, is determining that probability by looking at

the input for this particular case. What will happen if you try to minimize

this error is that most of the experts will end up ignoring most of the targets.

Each expert will only deal with the small subset of the training cases and it will

learn to do very well on that small subset.

8:28

So we have an input. Our different experts will look at that

input. They all make their predictions based on

that input. In addition we have a manager, a manager

might have multiple layers and the last layer for manager is a soft max layer, so

the manager outputs as many probabilities as there are experts,

And using the outputs of the manger and outputs of the experts, we can then

compute the value of that error fraction. If we look at the derivative of that other

function, The outputs of the manager are determined

by the inputs xi to the soft max group in the final layer of the manager.

And then the error is determined by the outputs of the experts, and also the

probabilities output by the manager. If we differentiate that error with

respect to the outputs of an expert, we get a signal for training that expert and

that gradient that we get with respect to the output of an expert is just the

probability of picking that expert, times the difference between what that expert

says in the target. So if the manager decides that there's a

very low probability of picking that expert for that particular training case,

the expert will get a very small gradient, and the parameters inside that expert

won't get disturbed by that training case. It'll be able to save its parameters for

modeling the training cases where the manager gives it a big probability.

10:03

We can differentiate with respect to the outputs of the gating network.

And actually what we're gonna do is differentiate with respect to, the

quantity that goes into the soft max. That's called the low jet, that's xi,

And if we take the derivative with respect to xi, we get the probability that, that

expert was picked times the difference between the squared arrow made by that

expert and the average overall experts when you use the weighting provided by the

manager of the squared arrow. So what that means is, if expert I makes a

lower squared error than the average of the other experts, then we'll try to raise

the probability of expert i. But if expert I makes a higher squared

error than the other experts, we'll try and lower his probability.

That's what causes specialization. Now there's actually a better cost

function. It's just more complicated.

It depends on mixture models, which I haven't explained in this course.

Again, those will be well explained in Andrew Ing's course.

I did explain, however, the interpretation of maximum likelihood, when you're doing

regression, as the idea that the network is actually making a Gaussian prediction.

That is the network outputs a particular value, say Y1 and we think of it as making

bets about what the target value might be that are a Gaussian distribution around Y1

with unit variance. So the red expert makes a Gaussian

distribution of predictions around by Y1 and the green expert makes a prediction

around Y2. The manager then decides probabilities for

the two experts and those probabilities are used to scale down the Gaussians.

Those probabilities have to add to one and they are called mixing proportions.

And so once we scale down the Gaussians we get to distribution that's no longer a

Gaussian, is the sum of the scale down red Gaussian and the scale down green

Gaussian. And that's the predictive distribution

from share experts. What we want to do now is maximize the log

probability of the target value under that black curve and remember the black curve

is just the sum of the red curve and the green curve.

12:45

And it's the sum over all the experts, of the mixing proportion assigned to that

expert by the manager or gating network times e squared the squared difference

between the target and the output of that expert,

Scaled by the normalization term for a Gaussian with a variance of one.

And so our cost function is simply going to be the negative log of that probability

on the left. We're going to try and minimize the

negative log of that probability.