0:00

In the last video you saw how very deep neural networks

Â can have the problems of vanishing and exploding gradients.

Â It turns out that a partial solution to this,

Â doesn't solve it entirely but helps a lot,

Â is better or more careful choice of the random initialization for your neural network.

Â To understand this lets start with the example of initializing the ways for

Â a single neuron and then we're going to generalize this to a deep network.

Â Let's go through this with an example with

Â just a single neuron and then we'll talk about the deep net later.

Â So a single neuron you might input four features x1 through x4 and then you have some

Â a=g(z) and end it up with some y

Â and later on for a deeper net you know these inputs will be right,

Â some layer a(l), but for now let's just call this x for now.

Â So z is going to be equal to w1x1 + w2x2 +... + it goes WnXn

Â and let's set b=0 so you know lets just ignore b for now.

Â So in order to make z not blow up and not become

Â too small you notice that the larger n is,

Â the smaller you want Wi to be, right?

Â Because z is the sum of the WiXi and

Â so if you're adding up a lot of these terms you want each of these terms to be smaller.

Â One reasonable thing to do would be to set the variance of Wi to be equal to 1 over n,

Â where n is the number of input features that's going into a neuron.

Â So in practice, what you can do is set the weight matrix W for a certain layer

Â to be np.random.randn you know,

Â and then whatever the shape of the matrix is for this out here,

Â and then times square root of 1

Â over the number of features that I fed into each neuron and

Â there else is going to be n(l-1)

Â because that's the number of units that I'm feeding into each of the units and

Â they are l. It turns out that if you're using

Â a value activation function that rather than 1 over n it turns out that,

Â set in the variance that 2 over n works a little bit better.

Â So you often see that in initialization especially if you're using

Â a value activation function so if gl(z) is ReLu(z),

Â oh and it depend on how familiar you are with random variables.

Â It turns out that something,

Â a Gaussian random variable and then multiplying it by a square root of this,

Â that says the variance to be quoted this way,

Â to be to 2 over n and the reason I went from n to this n superscript l-1 was,

Â in this example with logistic regression which is to

Â input features but the more general case

Â they are l would have an l-1 inputs each of the units in that layer.

Â So if the input features of activations are roughly mean 0 and standard variance

Â and variance 1 then this would cause z to also

Â take on a similar scale and this doesn't solve,

Â but it definitely helps reduce the vanishing,

Â exploding gradients problem because it's trying to set each of

Â the weight matrices w you know so that it's not too much

Â bigger than 1 and not too much less than 1 so it doesn't explode or vanish too quickly.

Â I've just mention some other variants.

Â The version we just described is assuming

Â a value activation function and this by a paper by [inaudible].

Â A few other variants,

Â if you are using a TanH activation function

Â then there's a paper that shows that instead of using the constant 2 it's

Â better use the constant 1 and so 1 over this

Â instead of 2 and so you multiply it by the square root of this.

Â So this square root term whoever plays

Â this term and you use this if you're using a TanH activation function.

Â This is called Xavier initialization.

Â And another version we're taught by Yoshua Bengio and his colleagues,

Â you might see in some papers,

Â but is to use this formula,

Â which you know has some other theoretical justification,

Â but I would say if you're using a value activation function,

Â which is really the most common activation function,

Â I would use this formula.

Â If you're using TanH you could try this version instead and some authors will also

Â use this but in practice I think all of these formulas just give you a starting point,

Â it gives you a default value to use for the variance of

Â the initialization of your weight matrices.

Â If you wish the variance here,

Â this variance parameter could be another thing that you could

Â tune of your hyperparameters so you could have

Â another parameter that multiplies into this formula and tune

Â that multiplier as part of your hyperparameter surge.

Â Sometimes tuning the hyperparameter has a modest size effect.

Â It's not one of the first hyperparameters I would usually try

Â to tune but I've also seen some problems with tuning this

Â you know helps a reasonable amount but this is usually lower down for me in terms

Â of how important it is relative to the other hyperparameters you can tune.

Â So I hope that gives you some intuition about the problem of vanishing or exploding

Â gradients as well as how choosing a

Â reasonable scaling for how you initialize the weights.

Â Hopefully that makes your weights you know not explode too

Â quickly and not decay to zero too quickly so you can

Â train a reasonably deep network without

Â the weights or the gradients exploding or vanishing too much.

Â When you train deep networks this is another trick that will help

Â you make your neural networks trained much.

Â