案例学习：预测房价

Loading...

來自 University of Washington 的課程

机器学习：回归

3709 個評分

案例学习：预测房价

從本節課中

Feature Selection & Lasso

A fundamental machine learning task is to select amongst a set of features to include in a model. In this module, you will explore this idea in the context of multiple regression, and describe how such feature selection is important for both interpretability and efficiency of forming predictions. <p> To start, you will examine methods that search over an enumeration of models including different subsets of features. You will analyze both exhaustive search and greedy algorithms. Then, instead of an explicit enumeration, we turn to Lasso regression, which implicitly performs feature selection in a manner akin to ridge regression: A complex model is fit based on a measure of fit to the training data plus a measure of overfitting different than that used in ridge. This lasso method has had impact in numerous applied domains, and the ideas behind the method have fundamentally changed machine learning and statistics. You will also implement a coordinate descent algorithm for fitting a Lasso model. <p>Coordinate descent is another, general, optimization technique, which is useful in many areas of machine learning.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

Okay. So,

in cases where you can't do all subsets, what can we do?

Well, one thing we can do,

which is gonna be option number two in our series of feature selection algorithms.

Is to do a sub-optimal, but

tractable approach to selecting amongst features.

And that's a set of greedy algorithms that we're gonna consider now.

So, one simple algorithm is what's called the forward stepwise algorithm,

where you start with some set of possible features.

And then what you're gonna do is, you're going to just greedily walk through each,

select the next best feature and keep iterating.

So, let's be a little bit more explicit about this.

So, first we're gonna start with the empty set, no features.

Or you could start with a list of features that you know you want to be included

in the model.

So, for example the constant feature which might be your feature, h0.

But then what you're gonna do is, you're gonna fit your model with whatever your

current set of features are to get an estimate of the weights on those features.

Your estimate of your regression coefficients.

And then what you're gonna do is, you're gonna search over all the possible

features that you might add, and choose the one that best improves your fit.

And so, one measure of this might be training errors.

So, you choose the feature that most reduces your training error.

And then you just include that in your model, and you recurse this procedure.

Where now you have a model with some number of features.

And you're gonna search over every possible feature you might wanna add and

choose the next one that drops your training error the most, and keep going.

So, let's visualize this procedure.

For this visualization,

we're gonna start with the picture that we had in our all subset search.

And we're gonna look at how the greedy search differs from

the search we would have done in all subsets.

So at the first step, we start with no features in our model.

So, we're actually gonna be at exactly the same point as we were in all subsets.

So, start at same

training error as all subsets.

Which is equivalent to no features in the model.

And then what we're gonna do is we're gonna search over all our features, and

choose the one that reduces training error the most.

And which one is that gonna be?

It was actually going to be the same one we got for all subsets.

Because at that point, when we we're doing all subsets and

looking over all models of one feature.

Again, we have the same choice of choosing

the best single feature even though we're doing this greedy procedure.

When we start with an empty set, our first best feature is gonna be the same

as the same best feature that you would get doing all subsets.

So again, stay at

all subset solution

For best model with

one feature.

And in that case, remember just like in all subsets,

we choose square feet as the best feature when we have just one feature.

But then things can start to diverge.

Because when we go to our second step in the greedy algorithm,

we're forced to keep square feet in our model.

And we're just greedily gonna search over the next feature.

That's why it's called greedy.

I should've mentioned this previously because we're just doing the quote

unquote, greedy thing of just saying, what's the next best thing, and

not thinking globally about what might be best.

Okay.

So in this case, maybe the next best thing that we could add to the model,

forcing ourselves to keep square feet in our model, is year renovated.

So, when we think about what greedy finds, add two features.

We have square feet living and year renovated.

But if we compare to what our all subsets had,

it was number of bedrooms and number of bathrooms.

So, this is where we see that things can be different.

Because once we're forced to keep square feet,

maybe the next thing that's most relevant to value,

you can have these very large old houses, but those might not be worth as much.

It might be the year that the house was renovated that's really relevant to

the value.

But as we described before, it might be reasonable to think that the best model

with just two features was number of bedrooms and number of bathrooms.

Because those are really key attributes that people use to value whether a house

is a good fit for them.

Well, then we continue with this procedure.

We already had square feet living, we had year renovated.

And now, in this case, our next feature we're gonna add is waterfront,

and we're gonna keep going on this process.

And when's the next time that our greedy procedure

is gonna meet the same solution as all subsets?

So, we started at the same point,

we said after the first iteration, we were at the same point.

Well, again, we'll be at the same point once we get all the way, obviously,

to including all features, coz there's only one such model with all features.

So, there are a couple points that I wanna make about this greedy procedure.

One is the fact that from iteration to iteration, error can never decrease.

And the reason for

that is, even if there's nothing else that you want to add to your model.

Even if everything else is junk,

or perhaps only making things a lot worse in terms of predictions.

You can always just set the weight equal to zero on that additional feature.

And then you'd have exactly the same model you had, which is one fewer feature.

And so, the training error would be exactly the same.

So, we know the training error will never increase.

And the other thing that we mentioned is that eventually, the training error will

be equivalent to that of all subsets when you're using all of your features,

and perhaps, before that as well.

Well, when do we stop this greedy procedure?

Just to really hammer this home, I'll just ask the question again.

Do we stop when the training error is low enough?

No.

Remember that training error will decrease as we keep adding model complexity.

So, as we keep adding more and more features, our training error will go down.

What about when test error's low enough?

Again, no.

Remember that we never touch our test data when we're searching over model

complexity, or choosing any kind of tuning parameter.

Again, what we're gonna do is use our validation set or

cross validation to choose when to stop this greedy procedure.

[MUSIC]