案例学习：预测房价

Loading...

來自 University of Washington 的課程

机器学习：回归

3813 個評分

案例学习：预测房价

從本節課中

Nearest Neighbors & Kernel Regression

Up to this point, we have focused on methods that fit parametric functions---like polynomials and hyperplanes---to the entire dataset. In this module, we instead turn our attention to a class of "nonparametric" methods. These methods allow the complexity of the model to increase as more data are observed, and result in fits that adapt locally to the observations. <p> We start by considering the simple and intuitive example of nonparametric methods, nearest neighbor regression: The prediction for a query point is based on the outputs of the most related observations in the training set. This approach is extremely simple, but can provide excellent predictions, especially for large datasets. You will deploy algorithms to search for the nearest neighbors and form predictions based on the discovered neighbors. Building on this idea, we turn to kernel regression. Instead of forming predictions based on a small set of neighboring observations, kernel regression uses all observations in the dataset, but the impact of these observations on the predicted value is weighted by their similarity to the query point. You will analyze the theoretical performance of these methods in the limit of infinite training data, and explore the scenarios in which these methods work well versus struggle. You will also implement these techniques and observe their practical behavior.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

So this leads us to consider something called Weighted K-nearest Neighbors.

And what weighted k-nearest neighbors does, is it down weights the neighbors

that are further from the specific query point, or target point.

And the effect that has is as we're shifting from target point to target

point, when a neighbor jumps in or out of our set of nearest neighbors,

the effect of that isn't as significant because when I'm computing what my

prediction is at a given value.

I've already down weighted that neighbor so when it drops out,

it was not that signficant in forming my predicted value to start with.

So, in particular, when we talk about our predicted value,

we're gonna weight each one of our nearest neighbor observations

by sum amount that I'll call C subscript Q, for

looking at query point Q, and then I also have a nearest neighbor j,

indexing which nearest neighbor that weight is associated with and

the question is how do we wanna define these weights?

Well, if the distance between my query point and

its nearest neighbor is really large and we want this constant that we're

multiplying by to be very small because we wanna down weight that observation.

But in contrast, if the distance is very small,

meaning this house is very similar to my house, I want that constant to be larger.

So one simple choice is simply to take our

weights to be equal to 1 over the distance between these points.

So Let's say that C, Q, nearest neighbor, J,

is gonna be equal to one over the distance

between XJ and XQ.

So this will have the effect that when the distance is larger the weight is smaller.

When the distance is smaller the weight is larger.

More generally we can think of defining these weights

using something that's called a kernel where here in this light we're gonna

focus on the simplest case of what are called isotropic kernels.

And these kernels are just the function of the distance between

any point and the target point, the query X Q.

And in this figure we're showing a couple examples of commonly used kernels.

What we see is this kernel is defining how the weights are gonna decay, if at all,

as a function of the distance between a given point and our query point.

So as an example here,

one commonly used kernel is something called The Gaussian kernel.

And what we see is that this kernel takes the distance between XI and

XQ, looks at the square of that distance.

And then divides that by a parameter of lambda and

that lambda is going to define how quickly the function decays and

importantly, this whole thing that I described is exponentiated so

that all these weights are positive.

But I wanna emphasize that this parameter lambda that all of these kernels have,

that's why I've written kernel sub lambda, define

how quickly we're gonna have the weights decay as a function of this distance.

And another thing that's worth noting is the fact that this Gaussian kernel that

I've mentioned here Has the weights never go exactly to zero,

they just become very, very, very, very small as the distance increases, but for

the other kernels I've shown on this plot here.

There is once you get beyond this minus lambda to lambda region,

the weights go exactly to zero.

So what we just described was assuming that we were just looking at a single

input like square feet.

And thus our distance we just our standard

Euclidean distance of looking at the absolute value between the two inputs.

But more generally you can use the same kernels and higher dimensional spaces,

where, again you just compute the distance, however you've defined it in

your multi varied space, and plug that in to the kernel to define your weight.

[MUSIC]