0:00

When you change your neural network,

Â it's important to initialize the weights randomly.

Â For logistic regression, it was okay to initialize the weights to zero.

Â But for a neural network of initialize the weights to parameters to all zero and

Â then applied gradient descent, it won't work.

Â Let's see why.

Â So you have here two input features, so

Â n0=2, and two hidden units, so n1=2.

Â And so the matrix associated with the hidden layer,

Â w 1, is going to be two-by-two.

Â Let's say that you initialize it to all 0s, so 0 0 0 0, two-by-two matrix.

Â And let's say B1 is also equal to 0 0.

Â It turns out initializing the bias terms b to 0 is actually okay,

Â but initializing w to all 0s is a problem.

Â So the problem with this formalization is that for

Â any example you give it, you'll have that a1,1 and

Â a1,2, will be equal, right?

Â So this activation and this activation will be the same,

Â because both of these hidden units are computing exactly the same function.

Â And then, when you compute backpropagation,

Â it turns out that dz11 and

Â dz12 will also be the same colored by symmetry, right?

Â Both of these hidden units will initialize the same way.

Â Technically, for what I'm saying,

Â I'm assuming that the outgoing weights or also identical.

Â So that's w2 is equal to 0 0.

Â But if you initialize the neural network this way,

Â then this hidden unit and this hidden unit are completely identical.

Â Sometimes you say they're completely symmetric,

Â which just means that they're completing exactly the same function.

Â And by kind of a proof by induction,

Â it turns out that after every single iteration of training your two hidden

Â units are still computing exactly the same function.

Â Since [INAUDIBLE] show that dw will be a matrix that looks like this.

Â Where every row takes on the same value.

Â So we perform a weight update.

Â So when you perform a weight update, w1 gets updated as w1- alpha times dw.

Â You find that w1, after every iteration,

Â will have the first row equal to the second row.

Â So it's possible to construct a proof by induction that if you

Â initialize all the ways, all the values of w to 0,

Â then because both hidden units start off computing the same function.

Â And both hidden the units have the same influence on the output unit,

Â then after one iteration, that same statement is still true,

Â the two hidden units are still symmetric.

Â And therefore, by induction, after two iterations, three iterations and so on,

Â no matter how long you train your neural network,

Â both hidden units are still computing exactly the same function.

Â And so in this case, there's really no point to having more than one hidden unit.

Â Because they are all computing the same thing.

Â And of course, for larger neural networks, let's say of three features and

Â maybe a very large number of hidden units,

Â a similar argument works to show that with a neural network like this.

Â [INAUDIBLE] drawing all the edges, if you initialize the weights to zero,

Â then all of your hidden units are symmetric.

Â And no matter how long you're upgrading the center,

Â all continue to compute exactly the same function.

Â So that's not helpful, because you want the different

Â hidden units to compute different functions.

Â The solution to this is to initialize your parameters randomly.

Â So here's what you do.

Â You can set w1 = np.random.randn.

Â This generates a gaussian random variable (2,2).

Â And then usually, you multiply this by very small number, such as 0.01.

Â So you initialize it to very small random values.

Â And then b, it turns out that b does not have the symmetry problem,

Â what's called the symmetry breaking problem.

Â So it's okay to initialize b to just zeros.

Â Because so long as w is initialized randomly,

Â you start off with the different hidden units computing different things.

Â And so you no longer have this symmetry breaking problem.

Â And then similarly, for w2, you're going to initialize that randomly.

Â And b2, you can initialize that to 0.

Â So you might be wondering, where did this constant come from and why is it 0.01?

Â Why not put the number 100 or 1000?

Â Turns out that we usually prefer to initialize

Â the weights to very small random values.

Â Because if you are using a tanh or sigmoid activation function, or

Â the other sigmoid, even just at the output layer.

Â If the weights are too large,

Â then when you compute the activation values,

Â remember that z[1]=w1 x + b.

Â And then a1 is the activation function applied to z1.

Â So if w is very big, z will be very, or at least some

Â values of z will be either very large or very small.

Â And so in that case, you're more likely to end up at these fat parts of the tanh

Â function or the sigmoid function, where the slope or the gradient is very small.

Â Meaning that gradient descent will be very slow.

Â So learning was very slow.

Â So just a recap, if w is too large, you're more likely to end up

Â even at the very start of training, with very large values of z.

Â Which causes your tanh or your sigmoid activation function to be saturated,

Â thus slowing down learning.

Â If you don't have any sigmoid or

Â tanh activation functions throughout your neural network, this is less of an issue.

Â But if you're doing binary classification, and your output unit is a sigmoid

Â function, then you just don't want the initial parameters to be too large.

Â So that's why multiplying by 0.01 would be something reasonable to try, or

Â any other small number.

Â And same for w2, right?

Â This can be random.random.

Â I guess this would be 1 by 2 in this example, times 0.01.

Â Missing an s there.

Â So finally, it turns out that sometimes they can be better constants than 0.01.

Â When you're training a neural network with just one hidden layer,

Â it is a relatively shallow neural network, without too many hidden layers.

Â Set it to 0.01 will probably work okay.

Â But when you're training a very very deep neural network,

Â then you might want to pick a different constant than 0.01.

Â And in next week's material, we'll talk a little bit about how and

Â when you might want to choose a different constant than 0.01.

Â But either way, it will usually end up being a relatively small number.

Â So that's it for this week's videos.

Â You now know how to set up a neural network of a hidden layer,

Â initialize the parameters, make predictions using.

Â As well as compute derivatives and implement gradient descent,

Â using backprop.

Â So that, you should be able to do the quizzes,

Â as well as this week's programming exercises.

Â Best of luck with that.

Â I hope you have fun with the problem exercise, and

Â look forward to seeing you in the week four materials.

Â