0:00

Drop out does this seemingly crazy thing of randomly knocking out units on your network.

Â Why does it work so well with a regularizer?

Â Let's gain some better intuition.

Â In the previous video,

Â I gave this intuition that drop-out randomly knocks out units in your network.

Â So it's as if on every iteration you're working with a smaller neural network,

Â and so using a smaller neural network seems like it should have a regularizing effect.

Â Here's a second intuition which is,

Â let's look at it from the perspective of a single unit. Let's say this one.

Â Now, for this unit to do his job as for inputs and

Â it needs to generate some meaningful output.

Â Now with drop out,

Â the inputs can get randomly eliminated.

Â Sometimes those two units will get eliminated,

Â sometimes a different unit will get eliminated.

Â So, what this means is that this unit,

Â which I'm circling in purple,

Â it can't rely on any one feature because any one feature could go away at

Â random or any one of its own inputs could go away at random.

Â Some particular would be reluctant to put all of its bets on,

Â say, just this input, right?

Â The weights, we're reluctant to put too much weight

Â on any one input because it can go away.

Â So this unit will be more motivated to spread out this way and give you a little bit of

Â weight to each of the four inputs to this unit.

Â And by spreading all the weights,

Â this will tend to have an effect of shrinking the squared norm of the weights.

Â And so, similar to what we saw with L2 regularization,

Â the effect of implementing drop out is that it shrinks

Â the weights and does some of those outer regularization that helps prevent over-fitting.

Â But it turns out that drop out can formally be

Â shown to be an adaptive form without a regularization.

Â But L2 penalty on different weights are different,

Â depending on the size of the activations being multiplied that way.

Â But to summarize, it is possible to show that drop out has

Â a similar effect to L2 regularization.

Â Only to L2 regularization applied to different ways can be

Â a little bit different and even more adaptive to the scale of different inputs.

Â One more detail for when you're implementing drop out.

Â Here's a network where you have three input features.

Â This is seven hidden units here,

Â seven, three, two, one.

Â So, one of the parameters we had to choose was

Â the cheap prop which has a chance of keeping a unit in each layer.

Â So, it is also feasible to vary key prop by layer.

Â So for the first layer,

Â your matrix W1 will be three by seven.

Â Your second weight matrix will be seven by seven.

Â W3 will be seven by three and so on.

Â And so W2 is actually the biggest weight matrix,

Â because they're actually the largest set of parameters

Â would be in W2 which is seven by seven.

Â So to prevent, to reduce over-fitting of that matrix,

Â maybe for this layer,

Â I guess this is layer two,

Â you might have a key prop that's relatively low,

Â say zero point five,

Â whereas for different layers where you might worry less about over-fitting,

Â you could have a higher key prop,

Â maybe just zero point seven.

Â And if a layers we don't worry about over-fitting at all,

Â you can have a key prop of one point zero.

Â For clarity, these are numbers I'm drawing on the purple boxes.

Â These could be different key props for different layers.

Â Notice that the key prop of one point zero means that you're keeping every unit and so,

Â you're really not using drop out for that layer.

Â But for layers where you're more worried about over-fitting,

Â really the layers with a lot of parameters,

Â you can set the key prop to be smaller to apply a more powerful form of drop out.

Â It's kind of like cranking up

Â the regularization parameter lambda of

Â L2 regularization where you try to regularize some layers more than others.

Â And technically, you can also apply drop out to the input layer,

Â where you can have some chance of just maxing out one or more of the input features.

Â Although in practice, usually don't do that that often.

Â And so, a key prop of one point zero was quite common for the input there.

Â You can also use a very high value, maybe zero point nine,

Â but it's much less likely that you want to eliminate half of the input features.

Â So usually key prop, if you apply the law,

Â will be a number close to one if you even apply drop out at all to the input there.

Â So just to summarize,

Â if you're more worried about some layers overfitting than others,

Â you can set a lower key prop for some layers than others.

Â The downside is, this gives you

Â even more hyper parameters to search for using cross-validation.

Â One other alternative might be to have some layers where you apply

Â drop out and some layers where you don't apply drop

Â out and then just have one hyper parameter,

Â which is a key prop for the layers for which you do apply drop outs.

Â And before we wrap up, just a couple implementational tips.

Â Many of the first successful implementations of drop outs were to computer vision.

Â So in computer vision,

Â the input size is so big,

Â inputting all these pixels that you almost never have enough data.

Â And so drop out is very frequently used by computer vision.

Â And there's some computer vision researchers that pretty much always use it,

Â almost as a default.

Â But really the thing to remember is that drop out is a regularization technique,

Â it helps prevent over-fitting.

Â And so, unless my algorithm is over-fitting,

Â I wouldn't actually bother to use drop out.

Â So it's used somewhat less often than other application areas.

Â There's just with computer vision,

Â you usually just don't have enough data,

Â so you're almost always overfitting,

Â which is why there tends to be some computer vision researchers who swear by drop out.

Â But their intuition doesn't always generalize I think to other disciplines.

Â One big downside of drop out is that the cost function J is no longer well-defined.

Â On every iteration, you are randomly killing off a bunch of nodes.

Â And so, if you are double checking the performance of grade and dissent,

Â it's actually harder to double check that you have

Â a well defined cost function J that is going downhill on every iteration.

Â Because the cost function J that you're optimizing is actually less.

Â Less well defined, or is certainly hard to calculate.

Â So you lose this debugging tool to will a plot,

Â a graph like this.

Â So what I usually do is turn off drop out,

Â you will set key prop equals one,

Â and I run my code and make sure that it is monotonically decreasing J,

Â and then turn on drop out and hope that I

Â didn't introduce bugs into my code during drop out.

Â Because you need other ways, I guess,

Â but not plotting these figures to make sure that your code is

Â working to greatness and it's working even with drop outs.

Â So with that, there's still

Â a few more regularization techniques that are worth your knowing.

Â Let's talk about a few more such techniques in the next video.

Â