0:54

In this case all of these parameters, theta one, theta two, theta three and so

on will be heavily penalized.

And so what ends up with most of these parameter values being close to zero.

And the hypothesis will be roughly h of x just equal or

approximately equal to theta zero.

And so we end up with the hypothesis that more or less looks like that.

It's a small lesser flat constant straight line and so

this hypothesis has high bias, and it badly underfits the status [INAUDIBLE].

Horizontal straight line's just not a very good model for this dataset.

At the other extreme is if we have a very small

value of lambda such as if lambda were equal to zero.

In that case, given that we're fitting a high-order polynomial,

this is our usual overfitting setting.

In that case given that we're fitting a high-order polynomial basically without

regularization or with very minimal regularization we end up with our usual

high variance overfitting setting.

It's basically if lambda is equal to zero we're just fitting it

with our regularization so that overfits the hypothesis.

And is only if we have some intermediate value of lambda that is neither

too large or too small that we end up with parameters theta that give us

a reasonable fit to this data.

So, how can we automatically choose a good value for

the regularization parimeters of lambda.

Just to reiterate, here's our model and here's our learning algorithm objective.

For the setting where we're using regularization,

then we define J train of theta to be something different.

To be the optimization objective but without the regularization term.

Previously, in a earlier video when we were not using regularization

I define J train of theta to be the same as J of theta as a cross function.

But when we're using regularization when the term we're going to define J train,

to be just my sum of squared errors on a training set or my average squared error

on the training set without taking into account that regularization term.

And similarly then also we're going to define the cross-validation sets error and

the test sets error as before to be the average sum of script errors on

the cross-validation and the test sites.

So, just to summarize, my definitions of J train, Jcv and

J test are just the average square that are one half of the average square

on my training validation and test sets without the extra regularization term.

So this is how we can automatically choose the regularization parameter long term.

What I usually do is maybe have some range of values of lambda I want to try out.

So, I might be considering not using regularization.

Well, here are a few values I might try.

I might be considering lambda equal to 0.01, 0.02, 0.04, and so on.

And I usually set these up in multiples of two until some maybe larger value.

If I were doing this in multiples of two I should end up with 10.24

instead of 10 exactly.

But, this is close enough and the third or

fourth decimal places won't affect your result that much.

So this gives me maybe 12 different models that I'm trying to select amongst

corresponding to 12 different values of the regularization parameter lambda.

And of course you can also go to values less than 0.01 or values larger than ten,

but I've just truncated that here for convenience.

Definition of these top models we can do is then the following.

We can take this first model with lambda equal zero, and minimize my cost

function J of theta, and this will give me some parameter vector theta.

And similar to the earlier video, let me just denote this as theta superscript one.

[COUGH] And then I can take my second model.

With lambda set to 0.01 and minimize my cost function

now using lambda equals 0.01 of course to get some different parameter vector theta.

Limited delta theta two.

And for that I end up with theta three so if this is fair for my third model and so

on until the final model with lambda is set to ten when I put ten or

10.24 and I put this theta 12.

Next I can take all of these hypotheses, all of these parameters and

use my cross validation set to evaluate them.

So I can look at my first model,

my second model fits of these different values of the regularization parameter and

evaluate them when I cross-validation sets basically measure the average squared

error of each of these parameter vector theta of my cross-validation set.

And I would then pick which ever one of these 12 models gives me the lowest

error on the cross-validation set.

And let's say for the sake of this example, that I end up picking theta five.

The fifth order polynomial because that has the lowest cross-validation error.

Having done that?

Finally, what I would do if I want to report test

set error is to take the parameter theta five that are selected and

look at how well it does on my test sets.

And once again,

here is as if we fit this parameter theta to my cross-validation sets.

Which is why I'm saving aside a separate test set.

That I'm going to use to get a better estimate of how well my

parameter vector theta will generalize to previously unseen examples.

So that's model selection applied to selecting

the regularization parameter lambda.

The last thing I'd like to do in this video is get a better understanding of

how cross-validation and

training error of vary as we vary the regularization parameter lambda.

And so just a reminder, all right?

That was our original cross function J of theta.

But for this purpose,

we're going to define training error without using a regularization parameter,

and cross-validation error without using the regularization parameter.

7:35

And we want a larger risk of overfitting.

Whereas if lambda is large that is if we were under the y part

of this horizontal axis then with a large value of lambda

we run a higher risk of having a bias problem.

So, if you plot J train and Jcv, what you find is that for small values of lambda,

you can fit the training set relatively well because you're not regularizing.

So for small values of lambda the regularization term basically goes away

and you're just minimizing, pretty much the square area.

So when lambda is small, you end up with a small value for

J train, whereas if lambda is large, then you have a high bias problem, and

then you might not fit your training style well, so you end up with a value up there.

So J train of theta will tend to increase when lambda increases.

Because a large value of lambda corresponds to high bias.

Where you might not even fit your training set well.

Whereas a small value of lambda corresponds to if you can

8:51

Where over here on the right if we have a large value of lambda,

we may end up underfitting.

And so this is the bias regime.

And so the cross-validation error will be high.

Let me just label that.

So that's Jcv of theta because with high bias we won't be fitting we won't be

doing well on the cross-validation set.

Whereas, here on the left, this is the high variance regime, where

if we have too small a value of lambda, then we may be overfitting the data.

And so, if we're overfitting the data, then it cross-validation error,

will also be high.

And so this is what the cross-validation error and

what the training error may look like on a training set as we vary the along there.

And so once again it will often be some intermediate value

of lambda that just quote, just right, or that works best in

terms of having the small cross-validation error or a small test set.

And whereas the curves I've drawn here are somewhat cartoonish and

somewhat idealized.

So on a real data set the curves you get may end up looking a little bit more messy

and just a little bit more noisy than this.

For some data sets you will really see these four source of trends and

by looking at the plot of the whole that cross-validation error.

You can either manually or automatically try to select a point

that minimizes the cross-validation error and

select a value of lambda corresponding to low cross-validation error.

When I'm trying to pick the regularization parameter, lambda for

a learning algorithm often, I find that plotting a figure like this one shown here

helps me understand better what's going on, and helps me verify that

I am indeed picking a good value for the regularization parameter lambda.

So hopefully, that gives you more insight into regularization and

its effects on the bias and variance of a learning algorithm.

By now, you've seen bias and variance from a lot of different perspectives.

And one way to do it in the next video is take all of the insights that

we've gone through and build on them to put together a diagnostic.

It's called learning curves, which is a tool that I often use to try to

diagnose if a learning algorithm may be suffering from a bias problem or

a variance problem, or a little bit of both.