And again, this could be a classification error in a development sense, or something

like the cost function, like the logistic loss or the log loss of the dev set.

Now what you find is that your dev set error will usually go down for

a while, and then it will increase from there.

So what early stopping does is, you will say well,

it looks like your neural network was doing best around that iteration, so

we just want to stop trading on your neural network halfway and

take whatever value achieved this dev set error.

So why does this work?

Well when you've haven't run many iterations for

your neural network yet your parameters w will be close to zero.

Because with random initialization you probably initialize w to small random

values so before you train for a long time, w is still quite small.

And as you iterate, as you train, w will get bigger and bigger and bigger until

here maybe you have a much larger value of the parameters w for your neural network.

So what early stopping does is by stopping halfway you have only

a mid-size rate w.

And so similar to L2 regularization by picking a neural network with smaller

norm for your parameters w, hopefully your neural network is over fitting less.

And the term early stopping refers to the fact that you're just

stopping the training of your neural network earlier.

I sometimes use early stopping when training a neural network.

But it does have one downside, let me explain.

I think of the machine learning process as comprising several different steps.

One, is that you want an algorithm to optimize the cost function j and

we have various tools to do that, such as grade intersect.

And then we'll talk later about other algorithms, like momentum and

RMS prop and Atom and so on.

But after optimizing the cost function j, you also wanted to not over-fit.

And we have some tools to do that such as your regularization,

getting more data and so on.

Now in machine learning, we already have so many hyper-parameters it surge over.

It's already very complicated to choose among the space of possible algorithms.

And so I find machine learning easier to think about

when you have one set of tools for optimizing the cost function J,

and when you're focusing on authorizing the cost function J.

All you care about is finding w and b, so that J(w,b) is as small as possible.

You just don't think about anything else other than reducing this.

And then it's completely separate task to not over fit,

in other words, to reduce variance.

And when you're doing that, you have a separate set of tools for doing it.

And this principle is sometimes called orthogonalization.

And there's this idea, that you want to be able to think about one task at a time.

I'll say more about orthorganization in a later video, so

if you don't fully get the concept yet, don't worry about it.

But, to me the main downside of early stopping is that

this couples these two tasks.

So you no longer can work on these two problems independently,

because by stopping gradient decent early,

you're sort of breaking whatever you're doing to optimize cost function J,

because now you're not doing a great job reducing the cost function J.

You've sort of not done that that well.

And then you also simultaneously trying to not over fit.

So instead of using different tools to solve the two problems,

you're using one that kind of mixes the two.

And this just makes the set of