0:00

In addition to L2 regularization,

Â another very powerful regularization techniques is called "dropout."

Â Let's see how that works.

Â Let's say you train a neural network like the one on the left and there's over-fitting.

Â Here's what you do with dropout.

Â Let me make a copy of the neural network.

Â With dropout, what we're going to do is go through each of the layers of

Â the network and set some probability of eliminating a node in neural network.

Â Let's say that for each of these layers,

Â we're going to- for each node,

Â toss a coin and have a 0.5 chance of

Â keeping each node and 0.5 chance of removing each node.

Â So, after the coin tosses,

Â maybe we'll decide to eliminate those nodes,

Â then what you do is actually remove all the outgoing things from that no as well.

Â So you end up with a much smaller,

Â really much diminished network.

Â And then you do back propagation training.

Â There's one example on this much diminished network.

Â And then on different examples,

Â you would toss a set of coins again and keep

Â a different set of nodes and then dropout or eliminate different than nodes.

Â And so for each training example,

Â you would train it using one of these neural based networks.

Â So, maybe it seems like a slightly crazy technique.

Â They just go around coding those are random,

Â but this actually works.

Â But you can imagine that because you're training a much smaller network on each example

Â or maybe just give a sense for why you end up able to regularize the network,

Â because these much smaller networks are being trained.

Â Let's look at how you implement dropout.

Â There are a few ways of implementing dropout.

Â I'm going to show you the most common one,

Â which is technique called inverted dropout.

Â For the sake of completeness,

Â let's say we want to illustrate this with layer l=3.

Â So, in the code I'm going to write- there will be a bunch of 3s here.

Â I'm just illustrating how to represent dropout in a single layer.

Â So, what we are going to do is set a vector d

Â and d^3 is going to be the dropout vector for the layer 3.

Â That's what the 3 is to be np.random.rand(a).

Â And this is going to be the same shape as a3.

Â And when I see if this is less than some number,

Â which I'm going to call keep.prob.

Â And so, keep.prob is a number.

Â It was 0.5 on the previous time,

Â and maybe now I'll use 0.8 in this example,

Â and there will be the probability that a given hidden unit will be kept.

Â So keep.prob = 0.8,

Â then this means that there's a 0.2 chance of eliminating any hidden unit.

Â So, what it does is it generates a random matrix.

Â And this works as well if you have factorized.

Â So d3 will be a matrix.

Â Therefore, each example have a each hidden unit there's a

Â 0.8 chance that the corresponding d3 will be one,

Â and a 20% chance there will be zero.

Â So, this random numbers being less than 0.8 it has a 0.8 chance of being one or be true,

Â and 20% or 0.2 chance of being false, of being zero.

Â And then what you are going to do is take your activations from the third layer,

Â let me just call it a3 in this low example.

Â So, a3 has the activations you computate.

Â And you can set a3 to be equal to the old a3,

Â times- There is element wise multiplication.

Â Or you can also write this as a3* = d3.

Â But what this does is for every element of d3 that's equal to zero.

Â And there was a 20% chance of each of the elements being zero,

Â just multiply operation ends up zeroing out,

Â the corresponding element of d3.

Â If you do this in python,

Â technically d3 will be a boolean array where value is true and false,

Â rather than one and zero.

Â But the multiply operation works and will

Â interpret the true and false values as one and zero.

Â If you try this yourself in python, you'll see.

Â Then finally, we're going to take a3 and scale it up by dividing by

Â 0.8 or really dividing by our keep.prob parameter.

Â So, let me explain what this final step is doing.

Â Let's say for the sake of argument that you have 50 units

Â or 50 neurons in the third hidden layer.

Â So maybe a3 is 50 by one dimensional or

Â if you- factorization maybe it's 50 by m dimensional.

Â So, if you have a 80% chance of keeping them and 20% chance of eliminating them.

Â This means that on average,

Â you end up with 10 units shut off or 10 units zeroed out.

Â And so now, if you look at the value of z^4,

Â z^4 is going to be equal to w^4 * a^3 + b^4.

Â And so, on expectation,

Â this will be reduced by 20%.

Â By which I mean that 20% of the elements of a3 will be zeroed out.

Â So, in order to not reduce the expected value of z^4,

Â what you do is you need to take this,

Â and divide it by 0.8 because

Â this will correct or just a bump that back up by roughly 20% that you need.

Â So it's not changed the expected value of a3.

Â And, so this line here is what's called the inverted dropout technique.

Â And its effect is that,

Â no matter what you set to keep.prob to,

Â whether it's 0.8 or 0.9 or even one,

Â if it's set to one then there's no dropout,

Â because it's keeping everything or 0.5 or whatever,

Â this inverted dropout technique by dividing by the keep.prob,

Â it ensures that the expected value of a3 remains the same.

Â And it turns out that at test time,

Â when you trying to evaluate a neural network,

Â which we'll talk about on the next slide,

Â this inverted dropout technique,

Â there is there is line to are due to the green box at dropping out.

Â This makes test time easier because you have less of a scaling problem.

Â By far the most common implementation

Â of dropouts today as far as I know is inverted dropouts.

Â I recommend you just implement this.

Â But there were some early iterations of dropout that

Â missed this divide by keep.prob line,

Â and so at test time the average becomes more and more complicated.

Â But again, people tend not to use those other versions.

Â So, what you do is you use the d vector,

Â and you'll notice that for different training examples,

Â you zero out different hidden units.

Â And in fact, if you make multiple passes through the same training set,

Â then on different pauses through the training set,

Â you should randomly zero out different hidden units.

Â So, it's not that for one example,

Â you should keep zeroing out the same hidden units is that,

Â on iteration one of grade and descent,

Â you might zero out some hidden units.

Â And on the second iteration of great descent

Â where you go through the training set the second time,

Â maybe you'll zero out a different pattern of hidden units.

Â And the vector d or d3, for the third layer,

Â is used to decide what to zero out,

Â both in for prob as well as in that prob.

Â We are just showing for prob here.

Â Now, having trained the algorithm at test time, here's what you would do.

Â At test time, you're given some x or which you want to make a prediction.

Â And using our standard notation,

Â I'm going to use a^0,

Â the activations of the zeroes layer to denote just test example x.

Â So what we're going to do is not to use

Â dropout at test time in particular which is in a sense.

Â Z^1= w^1.a^0 + b^1.

Â a^1 = g^1(z^1 Z).

Â Z^2 = w^2.a^1 + b^2.

Â a^2 =...

Â And so on. Until you get to the last layer and that you make a prediction y^.

Â But notice that the test time you're not using

Â dropout explicitly and you're not tossing coins at random,

Â you're not flipping coins to decide which hidden units to eliminate.

Â And that's because when you are making predictions at the test time,

Â you don't really want your output to be random.

Â If you are implementing dropout at test time,

Â that just add noise to your predictions.

Â In theory, one thing you could do is run a prediction process

Â many times with different hidden units randomly dropped out and have it across them.

Â But that's computationally inefficient and will give you roughly the same result;

Â very, very similar results to this different procedure as well.

Â And just to mention,

Â the inverted dropout thing,

Â you remember the step on the previous line when we divided by the cheap.prob.

Â The effect of that was to ensure that even when you don't see

Â men dropout at test time to the scaling,

Â the expected value of these activations don't change.

Â So, you don't need to add in an extra funny scaling parameter at test time.

Â That's different than when you have that training time.

Â So that's dropouts.

Â And when you implement this in week's premier exercise,

Â you gain more firsthand experience with it as well.

Â But why does it really work?

Â What I want to do the next video is give you

Â some better intuition about what dropout really is doing.

Â Let's go on to the next video.

Â