A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

來自 Johns Hopkins University 的課程

Statistical Reasoning for Public Health 2: Regression Methods

81 個評分

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

從本節課中

Introduction and Module 1A: Simple Regression Methods

In this module, a unified structure for simple regression models will be presented, followed by detailed treatises and examples of both simple linear and logistic models.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Welcome back.

In this section, we'll talk a little bit about how the computer estimates

the linear regression equation, given a set of data.

And we'll also deal with accounting for uncertainty in our slope intercept

estimates via confidence interval creation and hypothesis testing.

So hopefully, you'll appreciate after this section that creating confidence intervals

for linear regression slopes means essentially creating confidence intervals

for mean differences.

And the approach is business as usual.

We take our estimated slope and add and subtract two or

sometimes a little bit more standard errors.

And if we want to get a p value, the approach is the same as well.

We start by assuming the slope or the mean difference is zero, and then looking at

how far our result is from what we'd expect under that null hypothesis.

Similarly, creating a confidence interval for an intercept is akin to

creating a confidence interval for a single population mean and

follows the logic we used in Statistical Reasoning One.

So let's take a look at our arm circumference and

height example again to start.

So in the last section we showed the results from several simple linear

regression models, including this one with arm circumference and height.

And we estimated all we gave was the resulting estimated regression equation

based on these 150 data points that suggested that the mean arm

circumference was related to height via the following equation.

Take 2.7 and add 0.16 times the group of children's

height to estimate the mean arm circumference for that group.

I got this from a computer package,

but how does the algorithm work to estimate this equation?

Well there must be some sort of algorithm that will always yield the same results

for the same data set, regardless of what computer package we use to estimate it.

So the algorithm to estimate the equation of lines is called

least squares estimation.

And the idea is to find the line that gets closest to all of the points in

the sample.

The line that estimates the means that

have the least variability around those estimates.

So how can we define closeness to multiple points?

Well, in regression, closeness is defined as the cumulative square difference

between each point's observed y-value and the corresponding estimated mean,

y-hat for that point's x-value.

In other words, the squared distance between an observed y-value and

the estimated mean value for all points with the same value of x.

So each distance for each observed point in our data set can be estimated

by taking that point's value, for example, that child's value of arm circumference.

And subtracting the predicted mean of arm circumference for

children with the same height.

And so this distance on a scatter plot looks like

this vertical distance between each point and the mean for

children with that height value that is shown on the red regression line.

So the algorithm to actually estimate that regression line, again,

is called least squares,

because it minimizes the overall squared distance between all points in that line.

And so what the computer does is given the data, it chooses the values for

the intercept and the slope that minimize the cumulative distances squared.

So if we were to actually take the values of beta not and

beta 1 hat that minimize the cumulative, we took each point in our data set.

Each child's arm circumference subtracted

the predicted mean via the regression equation.

We square that distance and add it up across all data points in the sample.

The algorithm chooses the value of the intercepting slope that minimize that

cumulative square distance.

And so the algorithm doesn't have to keep trying different combinations of

beta not and beta 1 until it finds the one that gives the minimum square distance.

This can actually be done pretty easily using a calculus-based

approach to minimize this function, choose the values of beta not and

beta 1 hat that minimized this function.

The end result of this minimization give us what sometimes are called closed-forms

equations.

Equations which we could use to solve for the optimal values of beta not hat and

beta 1 hat in terms of the x and y values in our data set.

But I would never expect anyone to do a regression by hand, in fact,

I've never done a regression by hand because the computations are arduous and

time consuming.

However, the equation is very cool and makes for a nice piece of apparel

as evidenced by the fact that I actually have it on my tie.

The end result, however, are estimates based on the data we have at hand, and

these are just estimates based on our single sample from our population.

So if we were to actually have different random samples from the same population.

For example, different random samples of 150 Nepalese children

from the same population of Nepalese children less than 12 months.

We might get different estimates of beta not and

beta 1 depending on the sample we used.

So in other words,

the values that minimize the cumulative square distance for

different samples of the same size would likely be different across the samples.

So there's some sampling variability in these estimates.

So all regression coefficients, the intercept and

slope, have an associated standard error.

That can help us make statements about the true relationship between the mean and

y and our x predictor based on a single sample.

So there is a true regression equation in the population that has a true slope and

a true intercept.

We can only estimate these quantities.

So just like we've done with everything else that we estimate,

we're ultimately going to have to deal with the uncertainty in these estimates.

So let's again go look at the estimated regression correlation relating arm

circumference to height based on this one sample of 115 Nepalese children.

And again, here's our equation.

But actually the computer will give us the resulting estimated standard error for

our intercept and slope.

So for example, the slope was 0.16 and

the estimated standard error is 0.014.

So it turns out, remember, these are mean differences ultimately, these slopes and

the intercepts are means.

And so the random sampling behavior of these estimated regression coefficients is

essentially the random sample behavior for mean differences in means.

Which we've already showed that is generally normal from sample to sample and

centered at the true value that we're estimating.

So we can use the same ideas we used back in Statistical Reasoning One,

for creating 95% confidence intervals, for the true underlying

population level slopes and intercepts, and to get P values.

So let's look at the estimated regression equation for

this height in Nepali children.

So this slope here 0.16 estimated the mean difference in arm circumference,

in centimeters, per one centimeter difference in height.

That was just the estimate.

If we actually create a confidence interval,

the approach is the same old same old.

We take our estimate, it's a mean difference,

we add subtract two standard errors.

And we get a confidence interval of 0.13 centimeter difference in arm

circumference to 0.19 centimeter difference in arm circumference,

per 1 centimeter difference in height.

We could also test whether the true population level association or

mean difference in arm circumference Per unit difference in

height was zero or not.

So our null hypothesis is that this true population level mean difference or

the slope is zero and the alternative is that it's not zero.

So we'll do this the same way we've always done hypothesis testing,

we'll assume our sample comes from a population where the true slope is zero.

And then we'll measure how far our result estimate is from zero

in standard error units.

And so if we do this, we get a slope that's 11.4

standard errors above what we'd expect to see under the null hypothesis.

So, translating this to a p value means getting

the probability of being 11.4 or more standard errors away either above or

below from mean of 0 on a standard normal curve.

And the p values very low.

We already knew it would come in at less than .05 if yeah think about it.

because the confidence interval for the slope did not include 0.

But it's quite low, it's less than 0.001.

So how could we write this up?

We could say something like this research used simple linear regression to estimate

the magnitude of the association between arm circumference and height in Nepali

children less than 12 months old, using data on a random sample of 150.

A statistically significant positive association was found,

we could put the P value in parentheses.

The results estimate that two groups of such children who differ by one centimeter

in height, will differ on average by 0.16cm in arm circumference.

Perhaps I should clarify, for taller to less tall.

In other words, it's an increase in arm circumference, with an increase in height.

And a 95% conference interval, which gives a range of possibilities for

the true mean difference in arm circumference per 1 unit difference in

height, in the entire population of such children,

goes from 0.13 centimeters to 0.19 centimeters.

What if I wanted to give an estimate in a 95% confidence interval, for

the mean difference in arm circumference for

children 60 centimeters tall compared to children 50 centimeters tall?

Well, from the previous lecture section, we know

that this estimated mean difference can be expressed in terms of the slope by taking

the difference in our x value, which is 10 centimeters, or 10 units, and multiplying

it by the estimated mean difference in y per one unit difference in x.

So, the estimated mean difference in arm circumference per 1 unit difference in

height was 0.16 centimeters.

So, if the differences in height is 10 centimeters,

this would accrue to a cumulative difference of 1.6 centimeters on average.

But how do we actually get the standard error for

this mean difference for more than a one unit difference in our x value?

Well it turns out anything we do to our slope we do to the standard error.

So if our resulting comparison yields an estimate of 10 times the slope estimate,

we would take the standard error for the slope, and multiply it by 10.

So the standard error, in other words, the estimated standard error of 10 times

the slope, is equal to 10 times the standard error of the slope.

So the standard error of ten times beta one hat ten times the standard error for

beta one hat of 0.014 centimeters.

And that turns out to be 0.14 centimeters.

So 95% confidence interval for the mean difference in arm circumference for

these two groups of children who differ by 10 centimeters in height is

the estimated 1.67 centimeter difference in average arm circumference,

plus or minus 2, times, that's standard error, of 0.14 centimeters.

And if you do this out, we get a confidence interval of

1.32 centimeters to 1.88 centimeters.

So that interval describes our uncertainty in the estimated mean difference in arm

circumference between two groups of children who

differ by centimeters in height.

Recall our hemoglobin and

pack cell volume example, where the estimation regression line relating

mean hemoglobin level to pack cell volume was given by this equation.

The average hemoglobin level is equal to the intercept of

5.77 plus 0.2 times pack cell volume measured in percent.

So how are we going to compute a 95% confidence interval for this slope?

Well, this is exactly the same idea as we just saw.

But this sample was only 21 subjects.

So in order to get a confidence interval and p-value,

we're going to have to go slightly more than plus or

minus two standard errors to get our confidence interval.

And we'll have to compare our resulting difference between our estimate and

the null value, not to the standard normal curve, but to a t-distribution,

with n- 2, or 19 degrees of freedom.

And again, I'm not going to ask you to do this in a testing situation, or if I did,

I would give you this value.

The computer will handle this, but it's just nice to remember that in

smaller samples, we have to be a little more conservative.

So if we did this and we actually went to a t-distribution, or

let our computer do the work for us,

the number of standard errors required to get 95% coverage in the middle,

of the middle values, in a t-distribution with 19 degrees of freedom, is 2.09.

So, in order to get, this confidence interval we take the estimated

mean difference in hemoglobin per 1% difference in packed cell volume add and

subtract 2.09 times the estimated standard error of our slope, which is 0.046.

And we get a confidence interval that goes from 0.1 to 0.3 grams per

deciliter per 1% difference in packed cell volume.

So notice that that confidence interval does not include 0.

So we already know this result will be statistically significant at

the 0.05 level.

However, if we wanted to get the p-value for testing the null,

that the true slope of packed cell volume in the population from which the sample

was taken is 0 versus the alternative that it's not 0.

We'll again assume the null is true, assume the true slope is zero,

that our sample comes from a population where there's

no association between hemoglobin and packed cell volume.

We look at how far our estimated slope of 0.2 is from 0 in

terms of standard errors, and we get something that's

4.35 standard errors above what we'd expect under the null.

So the resulting p value is the probability of being 4.35 or

more standard errors above or below what we'd expect under the null,

but we're referring to get this to a t-curve with 19 degrees of freedom.

Nevertheless, in this example, the p-value comes in very low at less than .001.

So, the estimated slope is 0.2 with a 95% CI for 0.10 to 0.30.

So how can we interpret these results?

We can say, based on a sample of 21 subjects, we estimated that

packed cell volume is positively associated with hemoglobin levels.

And we could put the P Less than 0.001, if we wanted to.

We estimated that a one-percent increase in packed cell volume is associated

with a 0.2 grams per deciliter increase in hemoglobin on average.

Accounting for sampling variability,

this mean increase could be as small as 0.1 grams per deciliter or

as large as 0.3 grams per deciliter in the population of all such subjects.

So that brings in the confidence interval to express our uncertainty

in how much that mean difference

in hemoglobin is per one-percent difference in packed cell volume.

In other words, we estimated that the average difference in hemoglobin levels

for two groups of subjects who differ by one-percent in packed cell volume

is 2.2 grams per deciliter on average.

And accounting for sampling variability,

this mean difference could be as small as 0.1 grams per deciliter or

as large as 0.3 grams per deciliter in the populations of all such persons.

So what about the the intercepts?

So Paul and I've showed you how to construct confidence intervals and

do hypothesis testing for the slope from linear aggression and

multiples of the slope.

We can also create confidence intervals and get p-values, although they won't be

that useful for the intercept, in the same manner, and Stata and

other computer packages will present this in the output they get from regression.

However, as we've talked about when X1 is a continuous predictor,

many times the intercept is just a placeholder and

does not describe a useful quantity or a quantity of relevance to our data.

As such, 95% confidence intervals are not always relevant.

However when our predictor is binary or

categorical, the intercept may have a substantive interpretation and

a 95% confidence interval at least, may be of interest.

So let's take a look at an example of that.

So you recall, our analysis that we did in stat reading one, and

we just did as a linear regression in a previous section here, of length of stay

by age at first claim among the subjects from the Heritage Health Study.

And when we regressed average length of stay on an indicator of whether

the person was less than 40 at first stay or greater than or equal to 40,

we got a slope of -2.1 and intercept of 4.9.

So we interpret the slope as the estimated mean difference in length of stay for

persons less than 40 at first claim,

compared to persons over 40 and that was -2.1 days.

The younger group had average length of stays

of 2.1 days less than the older group.

And the intercept actually had meaning in this analysis.

It was the estimated mean length of stay for persons over 40 for

their first stay in 2011, their first claim.

So we can get confidence intervals and p-values for both these quantities.

So the slope, we estimate the mean difference between the younger group and

the older group to be 2.1 days less for the younger group.

But after accounting for uncertainty, and I should've put these in the proper order.

After accounting for the uncertainty in your estimate,

this is the 95% confidence interval for

the true mean difference in length of stay, for all patients in 2011.

You can see it's rather tight because this was a large data set and

it indicates that it's on the order of two or more days.

If we did a hypothesis test of whether the true association was zero,

in other words there was no association between length of stay and

age of first claim The p value is quite low.

We know that it would come in at less than 0.05, because our 95% confidence

interval did not include zero, but this adds some specificity to the discussion.

If we did a confidence interval for the slope, the estimated mean, like the stay

for those who are over 40, for their first visit in 2011 was 4.9 days.

And this confidence interval has meaning.

it goes from 4.8 days to 5.0 and

it expresses our uncertainty in that estimated mean.

So we have a pretty strong, tight interval here that suggests the true

length of stay on average was close to 5 days.

Between 4.8 and 5 days for the population of patients that were over 40,

when they entered the hospital in 2011.

We could get a p-value for this, but it really doesn't make sense to test

whether the mean length of stay for this single group is zero or not.

Because we know it can't be zero,

given that our data set only includes persons whose length of stay was 1 or

greater, so a p-value doesn't really add anything to the story here.

So in summary, the construction of confidence intervals for

linear regression slopes is business as usual.

Take the estimate and add or subtract two estimated standard errors, or

slightly more in smaller samples.

And we can also get a p-value by taking our slope estimate and converting it to

number of standard errors that is above or below the null value of zero.

And then figuring out what percentage of results we could get that were that far or

farther, just by chance, if the null is true.

So the confidence intervals we get for slopes and the resulting p-values

are confidence intervals and p-values for mean differences.

And the confidence intervals for intercepts are confidence intervals for

the mean of y for a specific group or

a specific population, the population whose x1 values are equal to zero.

And as we've discussed, this is not always relevant or helpful when x1 is continuous.

We can information to the analysis when our predictor is binary or categorical.