0:00

By now you've seen a couple different learning algorithms, linear regression

Â and logistic regression. They work well for many problems, but

Â when you apply them to certain machine learning applications they can run into a

Â problem called over fitting that can cause them to perform very poorly.

Â What I'd like to do in this video is explain to you what is this over fitting

Â problem, and in the next few videos after this, we'll talk about a technique called

Â regularization, that will allow us to or to ameliorate, or to reduce this over

Â fitting problem and get these learning algorithms to work much better.

Â So, what is over fitting? Let's keep using our running example of

Â predicting housing prices with linear regression where we want to predict the

Â price as a function of the size of the house.

Â One thing we could do is fit a linear function to this data.

Â And if we do that, maybe we get that sort of straight line fit to the data.

Â But this isn't a very good model. Looking at the data, it seems pretty

Â clear that, as the size of the house increases.

Â The, housing prices plateau, or kind of flattens out as we move to the

Â right. And so, this, algorithm doesn't fit the

Â trading set very well, and we call this problem underfitting,

Â and another term for this is that this algorithm has high bias.

Â both of these roughly mean that it's just not even fitting the trading data very

Â well. The term bias is kind of a historical, or

Â technical one, but the idea is that if fitting a straight line to the data, then

Â it's as if the algorithm has a very strong preconception, or a very strong

Â bias that housing prices are going to vary linearly with their size.

Â And despite the data to the contrary, despite the evidence to the contrary, as

Â preconceptions still bias, still causes it to fit a straight line and this ends

Â up being a poor fit to the data. Now, in the middle,

Â we could fit a quadratic function to the data, and with this data set,

Â we fit a quadratic function, maybe we get that kind of curve, and that works pretty

Â well. And at the other extreme would be if we

Â were to fit say, a fourth of the polynomials of the data.

Â So here, we have five parameters, theta zero through theta four.

Â And with that, we can actually fill the curve that process through all five of

Â our training examples. We might get a curve that looks like

Â this. That, on the one hand, seems to do a very

Â good job fitting the trading set. It is processed through all of my data at

Â least, but this is sort of a very wiggly curve.

Â It's like going up and down going all over the place, and we don't actually

Â think that's such a good model for predicting housing prices.

Â So, this problem we call overfitting, and another term for this is that this

Â algorithm has high variance. The term high variance, is another, sort

Â of historical, or technical one, but the intuition is that, if we're

Â fitting such a high older polynomial, then the hypothesis can fit, you know, is

Â almost as if it can fit almost any function, and the space of the possible

Â hypothesis, is just to large, or is to variable.

Â And we don't have enough data to constrain it, to give us a good

Â hypothesis. So that's how overfitting, and in the

Â middle there isn't really a name, but so I'm just going to write, you know, just

Â write where a second degree polynomial, a quadratic function, seems to be just

Â right for fitting this data. To recap a bit, the problem of

Â overfitting comes when, if we have too many features,

Â then the learn hypothesis may fit the trading set very well.

Â So, your cost function may actually be very close to zero,

Â or maybe even zero exactly. But you may then end up with a curve like

Â this, that you know, tries too hard to fit the training set, so that it even

Â fails to generalize to new examples. And it fails to predict prices on new

Â examples well. And here the term generalize refers to

Â how well a hypothesis applies even to new examples.

Â That is to data, to houses that it hasn't seen in the training set.

Â On this slide, we looked at overfitting for the case of linear regression.

Â A similar thing can apply to logistic regression as well.

Â Here's the other logistic regression example, with two features, x1 and x2.

Â One thing we could do is fit logistic regression with just a simple hypothesis

Â like this, where, as usual, g is my sigmoid

Â function. And if you do that, you end up with a

Â hypothesis, trying to use maybe just a straight line to separate the positive

Â and the negative examples. And this doesn't look at a, like a very

Â good fit to the hypothesis. And so, once again, this is an example of

Â underfitting, or of a hypothesis having high bias.

Â In contrast if you were to add to your features these quadratic terms, then you

Â could get a decision boundary that might look more like this.

Â And, you know, that's a pretty good fit to the data,

Â probably what probably about as good as we could get on this training set.

Â And, finally at the other extreme, if you were to fit a very high order polynomial.

Â If you were to generate lots of high-order polynomial terms as features,

Â then the logistic progression may contort itself.

Â We tried really hard to find a decision boundary that

Â fits your training data, or go to great lengths to contort a cell to fit every

Â single training example well. And, you know, if the features X1 and X2

Â are for predicting maybe the cancer tumor, you know cancer is a malignant

Â benign breast tumors. This doesn't, this really doesn't look like a very good

Â hypothesis for making predictions. And so once again, this is an instance of

Â overfitting, and of hypothesis having high variance and not really, and being

Â unlikely to generalize well to new examples.

Â Later in this course, when we talk about debugging and diagnosing things that can

Â go wrong with learning algorithms. We'll give you specific tools to

Â recognize when over fitting and also when under fitting may be occurring.

Â But for now, let's talk about the problem of, if we think overfitting is occurring,

Â what can we do to address it? In the previous examples, we had one or

Â two dimensional data, so we could just plot the hypothesis and see what was

Â going on and select the appropriate degree polynomial.

Â So earlier, for the housing prices example, we could

Â just plot the hypothesis and you know maybe see that it was fitting this other

Â very wiggly function that goes all over the place predicting housing prices and

Â we could then use figure like these to select an appropriate degree polynomial.

Â So plotting hypothesis could be one way to try to decide what degree polynomial

Â to use, but that doesn't always work, and in fact more often,

Â we may have learning problems that, where we just have a lot of features.

Â And there is not just a matter of selecting

Â what degree polynomial, and in fact, it, when we have so many features, it also

Â becomes much harder to plot the data, and becomes much harder to visualize it, to

Â decide what features to complete or not. So concretely, if we're trying to predict

Â housing prices, sometimes we can just have a lot of different features, and all

Â of the features seem, you know, maybe they seem kind of useful, but if we have

Â a lot of features and very little training data, then overfitting can

Â become a problem. In order to address overfitting, there

Â are two main options for things that we can do.

Â The first option is to try to reduce the number of features.

Â One thing that we could do is manually look at the list of features and use that

Â to try to decide which are the more important features, and therefore which

Â are the features we should keep and which of the features we should throw out.

Â Later in this class we'll also talk about model selection algorithms which are

Â algorithms. But automatically deciding which features

Â to keep and which features to throw out. This idea of reducing the number of

Â features can work well and can reduce over fitting and when we talk about model

Â selection we'll go into this in much greater depth.

Â But a disadvantage is that by throwing away some of the features, it's also

Â throwing away some of the information you have about the problem.

Â For example, maybe all of those features are actually useful for predicting the

Â price of a house, so maybe we don't actually want to throw some of our

Â information or throw some of our features away.

Â The second option, which we'll talk in, which we'll talk

Â about in the next few videos, is regularization.

Â Here, we're going to keep all the features that we're going to reduce the

Â magnitude, or the values of the parameters, theta j.

Â And this method works well, we'll see, when we have a lot of features.

Â Each of which contributes a little bit to predicting the value of y.

Â Like we had, like we saw in the housing price prediction example,

Â where we could have a lot of features, each of which are, you know, somewhat

Â useful, so maybe we don't want to throw them

Â away. So this describes the idea of

Â regularization at a very high level, and, I realize that all of these details

Â probably don't make sense to you yet. But in the next video we'll start to

Â formulate exactly how, to apply regularization and exactly what

Â regularization means. And, then we'll start to figure out how

Â to use this to make our learning algorithms work well and avoid

Â overfitting.

Â