A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

來自 Johns Hopkins University 的課程

Statistical Reasoning for Public Health 2: Regression Methods

81 個評分

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

從本節課中

Module 3A: Multiple Regression Methods

This module extends linear and logistic methods to allow for the inclusion of multiple predictors in a single regression model.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Greetings and welcome to lecture 6 section C.

In this section we'll talk about multiple linear regression.

We'll comment briefly on how researchers make decisions about if

they're fitting regression models,

what models to choose to present in terms of summarizing their research results.

And we'll also talk briefly about using these models to estimate outcomes for

different groups.

In the population, based on the sample results.

So in this section we'll extend

the concept of lease squares to estimation of multiple linear regression models.

Understand what the linearity assumption is as it applies to multiple

linear regression.

Explain different strategies for

picking a final multiple linear regression model among candidates.

Or final models if you want to show the results of more than

one multiple regression model, using the same outcome.

And different combinations of predictors or potential predictors.

And then use the results of multiple linear regression model to

compare groups who differ with more than one predictor value and

estimate means for groups given their x values.

So let's first talk about least squares for multiple regression.

The algorithm to estimate the equation of the multiple linear regression line,

I've used the acronym MLR here just for

brevity is called the "least squares" estimation procedure.

And the idea is to find the line, and actually, and

we'll talk about this in more detail in a minute, when we have multiple predictors

more than one x, the object in space is multi-dimensional,

like a plane or something with more than three dimensions.

The idea is to find the object that gets closest to all of the points

in the sample, when they're plotted in the number of dimensions for

the outcome and predictors they have.

So how do we define closeness to multiple points?

Well we can take these things that exist in multiple dimensions and

then reduce it to a single estimate by looking at the predicted mean for

each observation, given its multiple x input value.

So in regression, closeness is defined as the cumulative squared distance between

each point's y-value and the corresponding value of y-hat.

The estimated mean for that point set of x values.

So in other words, the estimated mean for

all subjects with characteristics like the one we're comparing to the mean.

In other words, the squared distance between an observed y value and

the estimated mean y value for all points with the same values of x.

So the distance can be computed for each data point in the sample.

And then the algorithm that chooses the values of the intercept and

the slopes is called the least squares and

again it chooses the values that minimize the cumulative distances of each estim.

Each outcome value from it's estimated mean for all observations like it,

based on having the same x values.

So, the algorithmic approach is the same, the computer does it, but

with more than one predictor the linear regression model is no longer estimating

a line in two-dimensional space.

For example with two predictors, the shape being described by the regression

equation is a plane in 3 dimensional space.

And for more than two predictors when we get into 4, 5, 6,

8,12 dimensional spaces mere mortals like ourselves cannot visualize

the resulting shape being estimated with one graphic.

So what does that mean then, about the linearity assumption,

when we had one outcome and

one predictor that was relatively straightforward to explain and assess?

Well, where the linearity assumption, and

the name linear regression comes in, is that the linearity assumption for

multiple regression makes the assumption the adjusted error estimate,

being estimated between y and each xi, if there's multiple x's.

In other words, the relationship between the mean of y and

each x, in a multiple model, is linear in nature.

This is an issue for continuous predictors, it's not for binary or

multi-categorical because essentially those always have a linear relationship

because we're just connecting two means when we have a binary predictor.

Or only in multiple categories we're connecting the mean for

reference group to the respective means for each of the other categories.

Again, 2 points to find each line.

So, there's not an issue of whether or not their comparison is linear because

it has to be, there's only two points being compared by each slope.

But, for continuous predictors, this is the issue we have to contend with.

So, this can be assessed by looking at what are called adjusted scatter

plots of each each y, xi relationship.

Adjusted for all the other variables used in the multiple regression model.

And we won't cover that or do that in this class, but

if you take a course in fitting models and doing data analysis, that will be covered.

But it's an extension of the idea of looking at a y/x scatter plot and

making decisions about whether linearity is appropriate or not.

So how do researchers go about choosing a final model?

In this class, we're more concerned with interpreting the results of what

others present in their research, but I just want to give some

thoughts on how researchers go ahead and choose what to present.

So when dealing with multiple regressions, and which models to present, for

a given outcome, whether to present one final model, or several models, and

how to choose those models.

Well model building and selection is a combination of science, statistics, and

the research goals which is related to the science.

So just some general thoughts here.

If the goal is to maximize the precision of our adjusted comparisons of each slope

in the regression model, then the idea might be to keep only those

predictors that are statistically significant in the final model.

And of course, if you have a main predictor of interest, a main exposure,

you would keep that in.

Regardless of statistical significance.

But the idea, here, is that if we're trying to estimate associations as

precisely as possible, then things that don't add information statistically

about the outcome, after we've accounted for other things, are just going to steal

precision from the things that are truly associated in the population.

Because we need to estimate more slopes with the same amount of data.

So by trimming, if you will,

the dead weight, the things that don't add information statistically.

We'll get more precise estimates of those things that do.

If the goal is to present research results comparable to results of similar analyses

presented by other researchers on similar or different populations.

Like, for example, if we wanted to look at the association

between the anthropometric measures in US children, and

we wanted to compare it to the findings of the researchers in Nepal and

they had presented a final analysis that included weight, height,

and sex as predictors then we would want to do the same even

if some of those were not significant in our analysis.

So in order to make comparisons between our finds adjusted for the same thing we

want to be sure to include the same predictors that the other researchers did,

regardless of the statistical significance in our results.

And we could comment about the difference in statistical significance that we found,

if in fact it were not, and tie that into the power, the study, etc.,

and make conclusions about potential differences in

growth patterns in infants in the US versus Nepal, etc.

If the goal is to show what happens to the magnitude of an association with different

levels of adjustment, so for example, if we're interested, for example,

in one relationship, the relationship between, and we'll look at an example in

the next section, BMI and the average amount of slow wave sleep somebody gets.

But that association is potentially confounded by other variables.

A nice way to show the sensitivity of that estimate is to start with the unadjusted

estimate then show the results of that estimate adjusted for

various combinations of potential confounders.

To show how either robust it is, regardless of what is used to adjust, or

how much things change with different levels of adjustment.

If we want to see how well this mean model that we use to estimate the mean as

a function of multi predictors, predicts from individuals from the same population

that were not used or studied in our data set, well that's a little more

complicated and we will discuss this briefly later in the course.

So let's talk about prediction.

You might call it prediction even though I just

said the models we're looking at are for estimation.

When I say prediction in this context we're going to look at estimating means

for different groups of a population based on the results from multiple

regression models.

So, recall the arm circumference results based on the sample

of 150 Nepalese children less than 12 months.

So, this is what model 3 looks like written out as an equation.

We estimate the mean.

This is the largest multiple regression model presented,

the one with the most predictors.

The mean arm circumference is estimated by taking 14.4 + -0.17

times the child's height + 1.46 times the group of children's weight,

plus the sex of the group of the children times 0.3.

So if we want to estimate the mean arm circumference for female children who

are 62 cm tall and 5.6 kilograms in weight, what would that look like?

Well in this case it would

look like 14.4 +

-.17(62) +

1.46(5.6) + .3.

And if we do this it's equal to 12.336 centimeters.

I'm just going to round that to 12.34 centimeters.

So this result estimates that for this group of children,

this is the estimated mean arm circumference for this group.

Interesting enough, the overall mean arm circumference for everyone,

ignoring weight, sex and height, is 12.4, so for this one prediction,

that's actually relatively close to what we would've predicted for this group,

had we ignored weight, height, and sex.

But for other groups, the estimated mean arm circumference will be very different.

Given their weight, height, and sex and

this overall model had an r squared of 78%.

This one includes weight, height, and sex so the implication here is that

while there's still variation, individual estimates around their height, weight, and

sex predicted arm circumference means the individual values will generally get a lot

closer to their height, weight, and sex specific means than they would if they

used the same mean for everyone that ignored their height, weight,

and sex, if we used that overall sample of 150 arm circumferences for

everyone instead of using the regression specific mean.

And I wanted you to estimate the mean difference.

We already did this for the female children 62 cm tall, 5.6 kg in weight.

We want to compare them to male children 58 cm tall and

4.5 kg, and I'm just going to rewrite this out for to make a point.

So the estimate we got for

the females, if we wrote it out,

was 14.4 + -.17(62)

+1.46 (5.6) +.3,

that was our 12.34.

If we do the same thing for males.

The male group who's 58 cm tall and 4.5 kg.

When we do this.

We get 14.4 + -.17(58) +

1.46(4.5), plus 0 for

sex because they're male, so their reference group.

If we do this out it turns out to be 11.11 centimeters,

so this group mean, this is lower.

And if we take the difference in these two means, it turns out to be 1.23 cm.

And we actually don't have the tools at hand to create a confidence interval for

that, but if we were using a computer, it could do a confidence interval for

this estimated mean difference and for these estimated group means as well.

I want you to notice something though.

If we do this piece-wise.

If we were to actually look at the parts that cancel out and the parts that differ.

If we were to subtract in pieces the intercept cancels.

When we take this difference here, the difference because of the 4 kilogram

difference, the 4 centimeter difference in height.

We're left with the slope for height times the difference in height, or

the slope times point, of -.17 times 4.

We're left with the difference, the slope of weight times the difference in weight.

5.6 minus 4.5 and we're left with a .3.

So if you actually look at this piece-wise and add the difference that's due to

the weight differences, the difference that's due to the height differences,

and the difference that's due to sex,

this actually turns out to be equal to 1.23 as well.

So, just showing that we could break this down into the components and

the slopes for each of the factors that differ between the two groups.

So we've shown that example.

Now let's look at the regression for emergency department waiting times, and

we're going to focus on the adjusted model, and

we're going to estimate means for different groups,

knowing that there are real and statistically significant differences

in the means across different groups in this population based on their race,

sex, age, and payer types, but there's still a lot of individual variability

in waiting times around the means for each of those groups in the population.

So this model, the results we get do not predict well for any one

individuals waiting time at all but our mean estimates will be slightly better

than if we used the overall waiting time, in the sample, for each of these groups.

So for this, to start,

I'd like you to estimate the mean waiting time for black males,

35 years old, with private insurance using the multiple linear regression results.

And actually, to make the approach easier for both of us,

I'm going to put the results here and tally what's going on inside.

So, I'm actually going to write the estimation in a vertical way to

add up the respective slopes of interest to the intercept.

So, we would start with the intercept which is 46.5.

And then, because this group was on private insurance,

the when we would estimate we would add nothing.

That was the reference group for payer type.

Because this group of black males is 35 years old,

they would be in the age category of 35 to 54 years, so

we would add 4.9 minutes to the average.

Because they identify as black we would 18 minutes

to this computation and because they are male we would

add -2.1 minutes and so if you do this addition,

46.5 plus 0 plus 4.9 plus 18.0 plus -2.1,

the average waiting time for this group is 68.3 minutes.

And then I wanted you to estimate the mean difference in waiting times for

the black males 35 years old on private insurance, the one we just did.

I want you to compare that mean estimate of 68.3 minutes

to White females 30 years old with public insurance.

Now I'm just going to line these up again.

And I'm going to just rewrite very quickly the Black males who

are 35 years old on private insurance, the components.

So we have the intercept of 46.5, 0 for being on private insurance, 4.9 for

being 35 to 54 years old, 18 extra minutes for

being black on the average wait times and a reduction of 2.1 minutes for being male.

Now if we do the same for the females,

white females who are 30 years old on public insurance.

Let's look at how theirs plays out.

They still get the intercept 46.5.

They're on public insurance so that adds 3.3 minutes to their average.

They're 30 years old so they're in the 20 to 34 year old category.

5.2 minutes the average.

Their white so their in the reference group so they don't get anything for race.

And they're female so they're in the reference group so

they don't get anything for sex in terms of adding to the model.

So the estimated weight time average for this group of black males is 68.3 minutes.

The estimated wait time average for

this group of white females is 55 minutes when you do out the math.

And so the difference in averages between these two groups is 13.3 minutes.

The first group, the black males, are at 35 years old.

With private insurance have an average wait of 13.3 minutes greater

than the 30 year old White females on public insurance.

And if you actually go across and look at where the differences are and

just add those up, the difference is -2.1 that the first group gets for being male.

The 18 for black.

The difference here is there's slightly shorter difference, for

being 35 to 40, 54 years old compared to 20 to 34 years old.

We take that difference and then if you take zero minus 3.3,

if you add up those things, you would also get 13.3.

So you could dissect this difference into its component parts

on the different predictors these groups compare.

So in summary multiple regression results can be used to estimate mean outcomes for

a given subset in a population given the predictor values and put into the model.

Multiple regression results can be used to estimate mean differences between

groups in the population who differ by more then one characteristic or predictor.

And then confidence intervals for the estimated means and

the estimated mean differences, can be estimated using a computer, and

their interpretation is the same, but it's not something where we have an easy way to

estimate the standard error of these estimates by hand.

In the next set we'll look at the results of regressions from research articles and

we'll talk about how the authors report on how they chose their final regression

models to present and what their subsequent interpretations were.