这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

Loading...

來自 Duke University 的課程

线性回归和建模

707 個評分

这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

從本節課中

More about Linear Regression

Welcome to week 2! In this week, we will look at outliers, inference in linear regression and variability partitioning. Please use this week to strengthen your understanding on linear regression. Don't forget to post your questions, concerns and suggestions in the discussion forum!

- Mine Çetinkaya-RundelAssociate Professor of the Practice

Department of Statistical Science

In addition to modeling and prediction we can

also use linear regression models to do inference.

In this video we're going to talk about hypothesis testing for

the significance of a predictor and

confidence interval for the slope estimate.

We're also going to talk a little more

about conditions for regression with respect to what additional

conditions may need to be satisfied if we want to

be able to do inference based on these data.

In 1966, Cyril Burt published a paper

called, the genetic determination of differences in intelligence.

A study of monozygotic twins reared apart.

The data consists of IQ scores for an assumed random sample of 27 identical

twins, one raised by foster parents, the other by the biological parents.

Later on in history this study actually got a lot of criticism saying that the

data may have been either non-random or non-representative or entirely falsified.

But, regardless for this example we're going to be working with the

original data set from the paper that was published back in 1966.

In the scatter plot we can see the

relationship between the foster twin's IQ and the biological

twin's IQ, and we can see that as one goes up the other one goes up as well.

We have a, a positive and relatively

strong relationship with a correlation coefficient of 0.882.

The results of this study can be summarized

using a regression output that looks something like this.

So we have the estimate for the intercept as well as the slope here.

We can, based on this, write our linear model as the predicted IQ score of the

fostered twin is 9.2076 plus 0.9014 times the biological twin's IQ.

The 9.2 value is the intercept, and the 0.9 value is the slope here.

To assess the fit of the model, we can also take a look at our R squared.

And R square is 0.78, meaning that 78% of the

variability in foster twins' IQs can be explained by the biological twins' IQ's.

Within the framework of inference for regression we're going

to be doing a hypothesis test on the slope.

The overall question we want to answer is, is

the explanatory variable a significant predictor of the response variable?

The null hypothesis as usual says there's nothing going on.

In other words, the explanatory variable is not

a significant predictor of the response variable, i.e.,

there's no relationship, or slope of the relationship is 0.

The alternative hypothesis says that there is something going on, that

the explanatory variable is a significant predictor of the response variable.

In other words, there is a relationship between these two

variables and the slope of the relationship is different than zero.

So the notion, our hypothesis says that beta one is equal to 0, remember

beta one was the population parameter for the slope, and

the alternative hypothesis says that beta one is not equal to 0.

And how do we go about actually going through this test?

In linear regression we always use a t-statistic for

inference, and remember that a t-statistic looks like this.

It's a point estimate minus a null value, divided by a standard error.

In this case, our point estimate is simply our slope estimate, b1.

And our standard error is the standard error of this estimate.

So the t-statistic for the slope can be summarized to

b1 minus 0, remember in the null hypothesis, we had set

the beta one equal to 0, which means no relationship

or a horizontal line, divided by the standard error of b1.

And whenever we have a t-score, we also

have a degrees of freedom associated with it.

And in this case the degrees of freedom is n minus 2.

Let's

pause for a moment and think about why is the degrees of freedom n minus 2.

We haven't really seen that before.

In the past we've seen for a t-statistic the degrees of freedom equaling n minus 1.

Remember, that with the degrees of freedom, we always lose

a degree of freedom for each parameter that we estimate.

And when we fit a linear regression, even if you're only interested

in the slope, you always also end up estimating an intercept as well.

And since we're estimating both an intercept and

a slope, we're losing two degrees of freedom, and

that's why in linear regression, the degrees of

freedom associated with a t-score is n minus 2.

For calculating the test statistic, we are actually

going to make use of the regression output and

then kind of show you guys that we didn't have to do any hand calculations at all.

So the t-statistic we said is our point

estimate, so that is 0.9014 for the point estimate

for the slope, minus 0, the null value,

divided by the standard error of the point estimate.

And we can simply grab that from the regression output as well.

We're not going to be asking you guys to be calculating any of this by hand.

You should know how the regression output works, and

that's why we're going through the calculation of the t-score.

But you're not going to asked ever to

calculate the standard error of the slope by hand.

It is simply a tedious task that can be

error prone and we usually use computation for it.

But it is important to understand what that standard error

means and how the mechanics of the regression output work.

If we do the math here, we're actually going to get a 9.36 for

our t-score, and that's simply the value

that's already given on the regression output anyway.

The degrees of freedom is 27 twins minus 2 is 25, and the p-value is going to be the

area under the t curve, that's greater than 9.36 or less than negative 9.36.

Remember we had a two sided alternative hypothesis.

This comes out to be a pretty low value as you can imagine, 9.36 standard errors

from the null value is a really unusual

outcome, and therefore the p-value is approximately 0.

We can see that the p-value is given

as exactly 0 on the regression output, but note

that that's simply rounded saying that when rounded

to four digits, we still have very little probability.

The p-value is probably never exactly equal to

0, but it's a very, very small number.

Just like we can do hypothesis tests for

the slope, we can also do a confidence interval.

Remember, the confidence interval is always of the form

point estimate plus or minus a margin of error.

In this case, our point estimate is b1, and our margin of error

can be calculated as usual, as a critical value times a standard error.

We said that in linear regression, we always use

a t-score, so we're going to use a t-star

for our critical value, and the standard error of

the slope, we said, comes from the regression output.

Using these, we can calculate the 95% confidence interval for

the slope of the relationship between biological and foster twins' IQs.

The degrees of freedom we had said was 25, and what we want to do first is

to find the critical t-score associated with this

degrees of freedom, and the given confidence level.

To find the critical t-score, let's draw our curve and mark the middle

95% and note that each tail is now left with 2.5%, or 0.025.

So, the cutoff value, or the critical t-score, can be calculated using R

and the qt function, as qt of 0.025 with degrees of freedom of 25.

This is going to yield a negative value, negative roughly 2.06.

But note that for confidence intervals the critical

value that we use always needs to be positive.

So the t-star is going to be simply 2.06.

We know our slope estimate, 0.9014 plus or minus 2.06 is the

critical value times the standard error that also comes from the regression

output, gives us 0.7 to 1.1 as our confidence interval.

And what do these numbers mean?

How do we interpret this confidence interval?

Basically what this means is that we are 95% confident that for each additional

point on the biological twins' IQs, the foster twins' IQs are expected on average

to be higher by 0.7 to 1.1 points.

So, to recap, we said that we could do a hypothesis test for

the slope, doing a t-statistic, where our point estimate is b1, our null, and

then we subtract from that a null value and divide by the standard error,

and the degrees of freedom associated with this test statistic is n minus 2.

To construct a confidence interval for the slope,

we simply take our slope estimate b1, and

add and subtract the margin of error, that's

composed of a critical t-score and a standard error.

Note that the null value is often 0, since we usually

check for any relationship between the explanatory and the response variables.

And also note that the regression output, gives us

b1, the estimate for the slope, the standard error for

that estimate, and the two tailed p-value for the

t-test for the slope, where the null value is 0.

So if this is the standard test that you are

trying to do, you shouldn't have to do any hand calculations

and should simply be able to make your decision on

the p-value that is given to you on the regression output.

We didn't really talk about inference for the intercept here.

We've been focusing on the slope because

inference on the intercept is rear, rarely done.

Earlier we said that in some cases,

the intercept is actually not very informative.

And usually when we fit a model, we want to

evaluate the relationship between the variables involved in the model.

And the parameter that tells us about the relationship

between those variables is the slope, not the intercept.

So we're going to focus our inference for regression

on the slope and not really worry about the intercept.

Before we wrap up, a few points of caution.

Always be aware of the type of data you're working with.

Is it a random sample, a non-random sample, or a population data?

Statistical inference and the resulting p-value are

completely meaningless if you already have population data.

So, we usually use statistical inference when we have a

sample, and we want to say something about the unknown population.

If you have a sample that is non-random, so it's biased in some way, note

that the results that arise from that sample are going to be unreliable as well.

And lastly, remember that the ultimate goal is to

have independent observations to be able to do statistical inference.

And by now in the course, you should

know how to check for the independent observations.

Remember, we like random samples, we do like large

samples, but we don't want them to be too large.

And we have that 10% rule that we check if we're sampling without replacement.