A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

68 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Module 3A: Multiple Regression Methods

This module extends linear and logistic methods to allow for the inclusion of multiple predictors in a single regression model.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

All right, welcome back. In this section we're going to look at some examples,

three examples of linear regression used in research articles and we're going to show how

the authors describe their approach to fitting

regression models and then look at the results in tabular format.

So hopefully this will give you an opportunity to interpret the results from

simple and multiple linear regression models presented in

several published journal articles.

So the first one is one we're already familiar with.

We looked at it several times in this course.

It's the Academic Physician's Salary published in

the Journal of the American Medical Association in 2012.

And what they ostensibly did was took a survey of

academic physicians and then wanted to compare the average salary,

yearly salary between male and female researchers but because there could be

other differences between the male and female researchers

that may also be related to salary,

it was important to adjust for those differences

to properly estimate the salary differential between men and women.

So we've looked at this initial results several times.

They say the mean salary within our cohort was $167,669 annually,

U.S. dollars annually for women and $200,433 annually for men.

And this difference was on the order of $33,000 per year.

But they went on to adjust this for multiple things including specialty,

academic rank, leadership positions,

publications, and research time and they talked about doing this in

the final model and we'll see in a minute what they're

referring to is a multiple linear regression model.

They found that after adjustment there was still

a sizable and statistically significant difference

in average salary between males and females.

However, it was lower in magnitude than originally estimated when

ignoring these other factors that were ultimately adjusted for.

So let's look at the method section to get a sense of how

they use linear regression to do their analysis.

So I'll just read it to you even though you're perfectly

capable of reading it but above just highlight certain things here.

They just start by saying they limit

the analytic sample to individuals with MD degrees who are still affiliated

with U.S. academic institutions and reported

their salary after comparing those who reported salaries and those who have not.

They initially looked to see if there were any biases in the data on other factors using

only those who reported salaries since extensively the study was

about salary they couldn't use the information on people who didn't report.

They go on to say, we described characteristics of

this sample by gender and we've seen some examples of that in

previous lectures where they look at for example

the distribution of region of the country by male and female where

the distribution of NIH funding tiers

by males and females with regards to the institution.

And then they go on to say, and then constructed

multiple variable linear regression models for

salary with the following respondent characteristics and they go on to tick off

a huge long list of things that they included

as other predictors above and beyond salary extensively to adjust for these.

They go on to say,

most characteristics were categorical and

modeled as indicators with a reference category.

So we've talked about doing that.

It's just nice to see people referencing something we've learned about in the text.

They go on to say,

we constructed both a full model using all these Covariates,

that's another name for predictors.

They ostensibly used the Oubre regression model that included

salary and all the things listed in that previous slide.

Then they also took a smaller,

more parsimonious model where they say,

we iteratively deleted variables from the bottle based on improvement in

the akaike information criterion

using both forward stepwise and backlights elimination approaches.

We have not talked about this akaike information in Criterion sometimes called the

AIC but it's a very similar process to what I was

referencing in the previous section where the researchers would go and pare things

down iteratively by taking things out that were not statistically significant.

One of the time refilling the model then removing the next with some kind of ordering

the next variable is not statistically significant until they

got down to a subset of predictors that were all statistically significant.

The forward approach would start with salary and then add in

subsequently more variables see if they were statistically significant if they weren't,

throw them out of a pool of

potential covariates- of potential predictors and try another one.

So they goes to a pair the model down to

only those predictors that are adding information statistically.

So they actually show the results of both these models.

I can only show you a piece of it because it spans two pages in the article.

But what they show here let's focus our eyes on

the prize in terms of the main question of interest, the gender comparison.

So they give the intercept from both models and then the initial model this is

the one that includes every possible confounder as a predictor.

So they have race here and they give

the adjusted differences in salary between

the race groups adjusting for all the other things in

the model but then they give an overall P-value testing whether

there are any differences in average salaries at the population level after adjustment.

And it's not statistically significant.

So ultimately they remove this from

their final model where they only included the statistically significant items.

They looked at age.

These are just some of the things they looked at whether or not

the researcher had children after adjusting for all these other things.

And while those who didn't have children had

lower average salaries in the sample after adjusting for

the other things this result was not statistically significant.

So in this model the one where they adjusted for everything whether it was

statistically significant or not the average differences salary between males and

females was about $12,000 per year in their final model.

So this is the model where they only

included the other predictors that were statistically significant.

So the reason there is no entries for all of these things here is because none of

these were statistically significant when they were put in that overall model.

Like I say this I'm not going to show you

a whole table but there's a whole another page of results and for some of

these other predictors they remained in the model and

their adjusted estimates are shown in this column as well.

So the average difference for salaries between

males and females upon adjustment only for the things that

remained in the final model was a little bit larger but it was still statistically

significant and it was on the order of $13,400 per year.

Here's another example of a study that used

linear regression techniques for sleep in BMI subjects with insomnia.

So ultimately here's the abstract in extensively the study

started by looking at polysomnographically determined sleep

monitored overnight especially the amount of slow-wave sleep

(SWS) and body mass index in patients with insomnia.

Now initially they recruited patients with insomnia and

people without insomnia for their study and they measured

things like height and body weight at the time of the study and they found

no significant correlations were found between total sleep time and BMI among insomniacs.

However compared with normal volunteers those without insomnia,

insomnia patients exhibited longer sleep late latency and shorter total sleep duration.

While the two groups had no significant differences in BMI,

insomniacs presented with more N1 but

less time spend in slow wave sleep and rapid eye movement sleep.

And based on their slow wave sleep time

they divided now they're going to focus in on the insomniacs

into three groups and they found differences

in the average BMI between several of these groups.

And then they go on this is really what they present in the paper so it's

interesting they hold up to the in the

abstract but this is what we're going to focus on.They say,

"Further analyses with multiple linear regression showed

a significant negative correlation between the amount of

slow wave sleep and BMI scores in insomniacs,

whereas no such correlation was found in

healthy volunteers after controlling for potential confounds like age,

sex, and other indices of sleep."

They actually looked at the regression separately for

insomniacs and healthy volunteers

and they found correlations among the insomniacs with BMI.

And so they go on to say, "Our study suggest that lower amounts of

slow wave sleep may be associated with higher BMI in patients with insomnia.

So then that let's just take a look at their data analysis section just to

show the presence of a lot of things

we've covered up till now in this two quarter sequence.

So they go on to say in the Data Analysis section,

Descriptive statistics examined were means and standard deviations.

Comparisons of the two groups we use

performed using students t-test for continuous variables.

Comparisons across trisection of time spent in

slow wave sleep that was the three groups they characterized by the

Tukey's were carried out using ANOVA for continuous variables with

normal distributions variables that do not distribute

normally were log transformed for statistical analyses.

That's not necessary but some researchers clinging to this idea

that data needs to be relatively normally distributed to do mean comparisons.

And then additional chi square analyses were used for

categorical data and they go on to talk about some other adjustments they made.

But then here's where we get to the part about regression.

Unadjusted linear regressions were used to assess the relationship

between time spent in slow wave sleep and REM sleep,

total sleep time and BMI.

The model was then adjusted for multiple potential confounding variables.

These potential confounding variables include age,

sex, education level, duration of illness, etc.

So I just want to point out,

so here's the first table they showed where they did

descriptive statistics and compared the means or percentages depending

on whether they were continuous or

categorical between the healthy controls and the insomnia patients.

And they report p-values ostensibly using the two sample T-test for

means and if they have proportions which they

don't in this table they would use the chi squared test.

But here's what I want to show you just to get a sense of

how much in one night of sleep slow waves we can expect.

In healthy control, the average was

83 minutes but there was a fair amount of person to person variability.

In insomnia patients, the average was on the order of slightly more than an hour,

62.8 minutes with a fair amount of patient to patient variability.

So just want to get a baseline sense of how

much on average we would expect in insomnia patient.

So then they talk about this specific linear aggression that was

sort of the crux of one of their findings in

the article and will show the results next to say.

Because there were significant inverse correlation

between time spent in slow wave sleep in BMI,

we further estimate that the unadjusted association between

time spent and slow wave sleep and BMI using linear regression.

We then added multiple potential confounding variables.

In the second multivariate model,

the confounders or covariates

the additional predictors that were potential founders were age,

sex, education level and duration of illness.

In the third model they added

more potential confounders and then

the final model included more on top of the third model.

And they actually will see now show that they presented the results from each of

the models focusing on the relationship between BMI and slow wave sleep time.

And they say here the significance levels was established at P less than 0.05.

So here they're saying their alpha level for their hypothesis tests is five percent.

So here they show the results of several linear regression models.

They don't show all estimated slopes.

They focus on the slope of

the slow wave sleep time as it predicts mean BMI from each of these models.

The first model is the Unadjusted Association.

They just simply regressed BMI on slow wave sleep time.

And we can see that there's a negative correlation every increased minute

in a slow wave sleep time is associated with

an average difference in BMI of 0.02 units lower.

So think about this so it would take you know if we were looking at

groups of patients who differ by half an hour in

their average nightly slow wave sleep that would estimate in

a lower BMI average for the group they got

more slow waves sleep on the order of 0.6 units.

So they give some sense of how much the magnitude of this association is.

They give a confidence interval and a P-value and it's statistically significant.

And then they go on to show that with these subsequent models they

start adjusting for things and then more things and then more things.

They show that while the estimate attenuates

a bit it gets a little bit smaller and absolute magnitude.

It's still negative and still statistically significant.

And one thing to note here I'm not sure why a B I would have asked this if I were

viewing the paper but the confidence interval they report for

the last adjusted association includes zero.

But the P-value is less than 0.05 and without knowing further what they did I

can't comment but my first reaction was there be some kind of reporting mistake here.

But on the whole what they're demonstrating is that there is in general to start and then

adjusted statistically significant negative

association between average BMI and slow wave sleep.

And that persists even after adjustment for other factors.

Layered other factors they start with this set.

Then they add these additional indices and then they finally add

REM sleep time and more indices

and throughout those layers of

adjustment once they start adjusting for the first layer age,

sex, and education level there's

little attenuation from the unadjusted the absolute value gets

more and then it stays consistently around the same order magnitude with more adjustment.

So they're demonstrating that the initial so easily found

wasn't completely explained by the other factors they adjusted for.

One last one we'll look at and then another sleep based study.

But it has to do with this is an earlier study from The Lancet in

1994 with blood pressure and snoring.

And so the summary in the abstract they say.

"The association between snoring and blood pressure is still a matter for a debate."

And they go on but they say partly because

the uncertainty about the definition of snoring and partly

because confounding factors may affect systemic blood pressure such as obesity,

sleep apnea, and nocturnal hypoxaemia.

So what they did is they took a large sample relatively large sample over 1,400

patients -- the majority male who were referred to a sleep disorder Study Center.

And they got a full history on their health with

particular attention to CBD and medications.

And they go on to summarize that the patients had nocturnal

polysomnography's or a sleep study including

objective measurement of snoring and blood pressure was measured in the morning.

18 percent of the non-snorers had hypertension as did

20 percent of the heavy snores that that proportion was not so significantly different.

However, they go on to say

multilinear regression analysis showed that

snoring was not a significant determinant of blood pressure.

Only age, male sex,

apnea or hypo apnoea index,

and body mass index contributed significantly the variability in blood pressure.

So just I want to hone in on something that I found interesting here.

There is a summary statistics they give for

the sample and let's look at this snoring index.

And the average is

323 for all subjects in the sample a fair amount of variability 321 standard deviation.

But this ranges from zero to 1,846.

And the authors define the snoring index is the number of snores.

I'm taking this verbatim from the text,

in an hour of sleep.

It was measured as part of the sleep study,

so I am very curious and there's nothing in the article that shows this but I'm very

curious about what the distribution of

these measures looks like across the people in the sample.

I would love to have seen a histogram of that.

So then they go on to report the regression results when they actually put there.

They call this univariate,

which means that these are the unadjusted estimates for example this is the regression

of systolic blood pressure on sex and only sex.

And here by putting male,

the implication is that this number compels males to

the female Reference Group of age and years BMI in kg/m2.

And then here down the table they have the snoring index.

So let's just look at for example sex.

Even though snoring index is the main question of interest.

So they show that males have on

average systolic blood pressures of

7.2 millimeters of mercury higher larger than females.

And the result is statistically significant.

And they give a standard error at 1.052.

So we could create a confidence interval if we wanted.

And this is the slope from

a simple regression where X is one for males and zero for females.

Then they show the adjusted association,

adjusted for the other factors in the model.

This is the slope of an X for sex in a model that includes age, BMI, AHI etc.

shows that after adjustment for these other factors males still

have a higher systolic blood pressure than comparable females.

But it's on the order of 5.4 and it is statistically significant.

So slightly smaller than the difference

when not adjusted for anything but still statistically significantly larger.

Let's go to snoring index now.

The slope- this compares.

So this is from this unadjusted slope is from a simple model that compares

the average systolic blood pressure to a slow time snoring index.

But snoring index is measured on a continuum.

So this compares the average difference in systolic blood pressures between

two groups who differ by one unit on the snoring index.

This assumes that the relationship between systolic blood pressure and snoring index is

well described by line across that entire range in the sample from zero to over 1,800.

And I would love to see a scatterplot of that relationship.

But in any case under that assumption they estimated that the per unit difference,

the average differences systolic blood pressure per

one unit difference of snoring index is on the order of 0.007

millimeters of mercury not a large amount but remember we're

comparing people in the sample whose values range from zero to over 1,800.

So if we actually compared two groups who's snoring

index is differed by 100 then we would expect the group with

more snoring a 100 units more to have a blood pressure on the order

0.7 Milligrams of mercury higher on average than those with the lower score.

This result is statistically significant in the non-adjusted comparison.

But notice when they go to adjust the estimate becomes

negative and is no longer statistically significant.

So it appears at first that there was a statistically significant association between

systolic blood pressure and snoring index.

But after adjusting for other factors that are potentially related to growth

the association from a statistical perspective disappeared.

So this is how they describe it.

They say results of the regression analysis are shown in Table 3.

Although univariate analysis shows

statistically significant contributions for all variables when considered individually.

In other words all of the unadjusted associations

were significant full model multi-variate analysis showed that only male sex,

age, BMI and AHI which is an apnea index contribute significantly to a final model.

These variables in fact were the only ones selected by a stepwise multiple rush analysis.

In other words they let the computer

choose the results that were statistically significant

to form their ultimate model which they actually don't show.

They only show the results of a model that included all of

these predictors in which the r2 =

0.18 for diastolic blood pressure and

0.21 for systolic blood pressure. That's what we were looking at.

So they're substantially saying they estimate based on

their data that together; sex, age, BMI,

and this apnea index explain an estimated 21 percent of

the variation in the systolic blood pressure measurements in their sample.

So hopefully this has been informative to show you at least some examples of the use of

linear regression in research and the reporting of both

how the researchers conceptualized modeling their results,

how they chose their final if you will multiple regression model and

then the presentation of the results and how they interpret them substantively.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.