0:00

Gradient checking is a technique that's helped me save tons of time, and

Â helped me find bugs in my implementations of back propagation many times.

Â Let's see how you could use it too to debug, or

Â to verify that your implementation and back process correct.

Â So your new network will have some sort of parameters, W1, B1 and so on up to WL bL.

Â So to implement gradient checking, the first thing you should do is take all your

Â parameters and reshape them into a giant vector data.

Â So what you should do is take W which is a matrix, and reshape it into a vector.

Â You gotta take all of these Ws and reshape them into vectors, and then concatenate

Â all of these things, so that you have a giant vector theta.

Â Giant vector pronounced as theta.

Â So we say that the cos function J being a function of the Ws and

Â Bs, You would now have the cost function J being just a function of theta.

Â Next, with W and B ordered the same way,

Â you can also take dW[1], db[1] and so on, and initiate them into big,

Â giant vector d theta of the same dimension as theta.

Â So same as before, we shape dW[1] into the matrix, db[1] is already a vector.

Â We shape dW[L], all of the dW's which are matrices.

Â Remember, dW1 has the same dimension as W1.

Â db1 has the same dimension as b1.

Â So the same sort of reshaping and concatenation operation,

Â you can then reshape all of these derivatives into a giant vector d theta.

Â Which has the same dimension as theta.

Â So the question is, now, is the theta the gradient or

Â the slope of the cos function J?

Â So here's how you implement gradient checking, and

Â often abbreviate gradient checking to grad check.

Â So first we remember that J Is now a function of the giant parameter,

Â theta, right?

Â So expands to j is a function of theta 1, theta 2, theta 3, and so on.

Â 2:06

Whatever's the dimension of this giant parameter vector theta.

Â So to implement grad check, what you're going to do is implements a loop so

Â that for each I, so for each component of theta,

Â let's compute D theta approx i to b.

Â And let me take a two sided difference.

Â So I'll take J of theta.

Â Theta 1, theta 2, up to theta i.

Â And we're going to nudge theta i to add epsilon to this.

Â So just increase theta i by epsilon, and keep everything else the same.

Â And because we're taking a two sided difference,

Â we're going to do the same on the other side with theta i, but now minus epsilon.

Â And then all of the other elements of theta are left alone.

Â And then we'll take this, and we'll divide it by 2 theta.

Â And what we saw from the previous video is that

Â this should be approximately equal to d theta i.

Â Of which is supposed to be the partial derivative of J or of respect to,

Â I guess theta i, if d theta i is the derivative of the cost function J.

Â So what you going to do is you're going to compute to this for every value of i.

Â And at the end, you now end up with two vectors.

Â You end up with this d theta approx, and

Â this is going to be the same dimension as d theta.

Â And both of these are in turn the same dimension as theta.

Â And what you want to do is check if these vectors are approximately equal to

Â each other.

Â So, in detail, well how you do you define whether or

Â not two vectors are really reasonably close to each other?

Â What I do is the following.

Â I would compute the distance between these two vectors,

Â d theta approx minus d theta, so just the o2 norm of this.

Â Notice there's no square on top, so

Â this is the sum of squares of elements of the differences, and

Â then you take a square root, as you get the Euclidean distance.

Â And then just to normalize by the lengths of these vectors,

Â divide by d theta approx plus d theta.

Â Just take the Euclidean lengths of these vectors.

Â And the row for the denominator is just in case any of these vectors are really small

Â or really large, your the denominator turns this formula into a ratio.

Â So we implement this in practice,

Â I use epsilon equals maybe 10 to the minus 7, so minus 7.

Â And with this range of epsilon, if you find that this formula gives you

Â a value like 10 to the minus 7 or smaller, then that's great.

Â It means that your derivative approximation is very likely correct.

Â This is just a very small value.

Â If it's maybe on the range of 10 to the -5, I would take a careful look.

Â Maybe this is okay.

Â But I might double-check the components of this vector, and

Â make sure that none of the components are too large.

Â And if some of the components of this difference are very large,

Â then maybe you have a bug somewhere.

Â And if this formula on the left is on the other is -3, then I would wherever you

Â have would be much more concerned that maybe there's a bug somewhere.

Â But you should really be getting values much smaller then 10 minus 3.

Â If any bigger than 10 to minus 3, then I would be quite concerned.

Â I would be seriously worried that there might be a bug.

Â And I would then, you should then look at the individual

Â components of data to see if there's a specific value of i for

Â which d theta across i is very different from d theta i.

Â And use that to try to track down whether or

Â not some of your derivative computations might be incorrect.

Â And after some amounts of debugging, it finally, it ends up being this

Â kind of very small value, then you probably have a correct implementation.

Â So when implementing a neural network,

Â what often happens is I'll implement foreprop, implement backprop.

Â And then I might find that this grad check has a relatively big value.

Â And then I will suspect that there must be a bug, go in debug, debug, debug.

Â And after debugging for a while, If I find that it passes grad check with a small

Â value, then you can be much more confident that it's then correct.

Â So you now know how gradient checking works.

Â This has helped me find lots of bugs in my implementations of neural nets,

Â and I hope it'll help you too.

Â In the next video, I want to share with you some tips or

Â some notes on how to actually implement gradient checking.

Â Let's go onto the next video.

Â