If I'm able to look at those points and explain why they're not well fit,

then I have typically learned something that I can

incorporate in a subsequent iteration of the regression model.

Now if that all sounded a little bit abstract,

I've got an example to show you right now.

So here's another data set that lends itself to a regression analysis.

And in this data set I've got two variables.

The outcome variable, or the y variable, is the fuel economy of a car.

And to be more precise,

it's the fuel economy as measure by gallons per thousand miles in the city.

So let's say you live in the city and

you only drive in the city, how many gallons are you going to have to put

in the tank to be able to drive your car 1,000 miles over some course of time?

That's the outcome variable.

Clearly the more gallons you have to put in the tank,

the less fuel efficient the vehicle is.

That's the idea.

Now we might want to create a predictive model for

fuel economy as a function of the weight of the car.

And so here I've got an X variable as weight.

And I'm going to look for the relationship between the weight of a car and

it's fuel economy.

We collect the set of data.

That's what you can see in the scatter plot.

The bottom left-hand graph on this slide.

And each point is a car.

And for each car, we've found it's weight, we've found it's fuel economy,

we've plotted the variable against one another.

And we have a run a regression through those points

through the method of least squares.

And that regression gives us a way of predicting the fuel economy of a vehicle

of any given weight.

Now why might you want to do that?

Well, one of the things that many vehicle manufacturers are thinking about these

days, is creating more fuel efficient vehicles.

And one approach to doing that is to actually change the materials that

vehicles are manufactured from.

So for example, they might be moving from steel to aluminum.

Well, that will reduce the weight of the vehicle.

Well, if the vehicle's weight is reduce,

I wonder how that will impact the fuel economy?

And so that's sort of question that we'd be able to start addressing through

such a model.

So that's a setup for this problem, but

I want to show you why looking at the residuals can be such a useful thing.

So when I look at the residuals from this particular regression, I know one of

the residuals, actually I found the biggest residual in the whole data set.

And that's the point that I have identified in red on the scatter plot.

And it is the biggest residual, it's a big positive residual.

Which means that the reality is, that this particular vehicle

needs a lot more gas going in the tank than the regression model would predict.

The regression model would predict the value on the line.

The red data point is the actual observed value.

It's above the line, so it's less fuel efficient than the model predicts.

It needs more gas to go in the tank than the model predicts.

So is there anything special about that vehicle?

Well, at that point I go back to the underlying data set and I drill down.

So, when I see big residuals, I'm going to drill down on those residuals.

And drilling down on this residual actually identifies the vehicle.

And the vehicle turns out to be something called a Mazda RX-7.

And this particular vehicle is somewhat unusual,

because it had what's termed a rotary engine,

which is a different sort of engine than any other single vehicle in this data set.

Every other vehicle had a standard engine, but the Mazda RX-7 had a rotary engine.

And that actually explains why its fuel economy is bad in the city.

And so by drilling down on the point, by looking at the residuals,

I've identified feature that I hadn't originally incorporated into the model.

And that would would be the type of engine.

And so, the residual and the exploration of the residual has

generated a new question for me that I didn't have prior to the analysis.

And that questions is,

I wonder how the type of engine impacts the fuel economy as well?

So that's one of the outcomes of a regression that can be very, very useful.

It's not the regression model directly talking to you.

It's the deviations from the underlying model that can sometimes be the most

insightful part of the model itself or the modeling process.

So remember in one of the other modules,

I talked about, what are the benefits of modelling?

And one of them is serendipitous outcomes, things that you find that you hadn't

expected to at the beginning, and I would put this up there as an example of that.

By exploring the residuals carefully, I've learned something new,

something that I hadn't anticipated.

And I might be able to subsequently improve my model by incorporating this

idea of type of engine into the model itself.

So the residuals are an important part of a regression model.