0:24

Remember if we include an intercept, the residuals have to sum to zero,

which means their mean is zero.

So if we want to take the variance of the residuals,

it's just the average of the squares.

So the sum of the squared residuals,

times one over n, is an estimate of sigma squared.

The variation around the regression line.

The true population variation around the regression line.

0:48

Now most people use n minus two, instead of n.

So it's not the average squared residual,

it's kind of like the average squared residual.

And, and for large n, the difference between one over n minus two, and

one over n is irrelevant.

But for small n, it can make a difference.

The way to think about that is, remember,

if we include the intercept the residuals have to sum to zero.

So, that puts a constraint.

If you know n minus one of them, then, you know the nth.

Well, if you have a line term in there, if you have a co-variant in there, then,

that puts a second constrain on the residuals.

So, you lose two degrees of freedom.

If you put another regression variable in there, you have another constraint,

you lose three degrees of freedom.

So in that sense it's sort of like saying you really don't have n residuals,

you have n minus two of them,

because if you knew n minus two of them you could figure out the last two.

And that's why it's one over n minus 2.

So let me show you how you can grab the residual variation

out of your l m fit and assign it to a variable.

This way if you needed, if you need to work with it in an R program you can

actually grab the number, not just see it on the printout.

So here I've defined my y and my x, and I've defined my

fit as the regression model with y as the outcome and x as the predictor.

Well if you just do summary of fit and you don't do anything else,

you just hit return, it'll print out the summary of the regression model.

Intercepts, slopes, estimated values, and so on, and you'll see the residual

standard deviation estimate among the elements in the printout.

However, if you want to grab it as an object that you can assign to something,

just put dollar sign sigma.

Then you can assign sigma to any other variable.

So if you're using it in a program in some other way.

This works out in this particular example to be 31.84 dollars.

2:34

So here, let's just confirm that I'm not lying to you and that the formula works.

So if I do resid fit, that grabs the residuals.

If I square it, it squares them.

If I sum it, it adds up the squared values.

If I divide by n minus two, it takes the average of the unique residuals.

And then if I square root it, we get 31.84 so I wasn't lying.

2:58

Now let's go back to this plot where we look at the total variability in diamond

prices.

And then compare what happens to the variability when we explain

some of that variability with a regression line.

So the total variability is just the deviations of my data.

The average squared deviation of my data around its mean.

Around the center.

And just to make things easy, let's forget about the denominator and

just talk about the sum of the squared deviations.

3:40

And that's the variability in the response that's simply explained by the regression,

by the regression line.

Let's take how that, how much that deviates around the average.

So that's the regression variability.

Then everything that's left over is that variation around the regression line, and

that's the residual variability.

And the interesting identity, and

it kind of makes sense that this would be the case, is that this total variability.

The variability in diamond prices disregarding everything except for

where they're centered at.

Is equal to the regression variability,

that is the variability explained by the model.

4:23

Because the residual variation and

the regression model variation add up to the total variation.

We can define a quantity that represents the percentage of the total variation

that's represented by the model.

Simply take the regression variation and divide it by the total variation.

That quantity is called R squared.

So R squared for our diamond example, is the percentage of the variation

in diamond price, that is explained by the regression relationship with mass.

4:57

Just remind you, R squared is the percentage of the variation in

the response explained by the linear relationship with the predictor.

R squared has to be between zero and

one because the regression variability and the error variability and the sums

of the squares add up to the total sums of squares, and they're all positive.

So that forces our square to be between zero and one.

If we define R as the sample correlation between the predictor and the outcome,

then R squared is literally that sample correlation R, squared.

5:31

R squared can be a misleading summary of model fit.

For example, if you have somewhat noisy data and delete all the,

a lot of the points in the middle, you can get a much higher R squared.

Or if you just add arbitrary regression variables into a linear model fit,

you increase R squared and

you de, decrease mean squared error where the average squared residual variation.

So these things have to be taken into a mind,

into mind if you're using them to assess model fit.

5:58

Anscombe created a particularly stark example of a bunch of data sets with

an equivalent R squared, equivalent mean, and variances in the x's and the y's.

And identical regression relationships, but when you look at the scatter clouds,

you can see that the fit has very different meanings in each of the cases.

So, let's look at the outcome from that example Anscombe, and see what it shows.

And here it is, the four data sets.

The first is a nice regression line, exactly sort of along the lines

of what we think of, when we think of just a slightly noisy x y relationship.

The second one, clearly there's a missing term.

In order to address some of this curvature in the data.