A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

來自 Johns Hopkins University 的課程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

238 個評分

Johns Hopkins University

238 個評分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

從本節課中

Module 4B: Making Group Comparisons: The Hypothesis Testing Approach

Module 4B extends the hypothesis tests for two populations comparisons to "omnibus" tests for comparing means, proportions or incidence rates between more than two populations with one test

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So in lecture section, we're going to extend the ideas that we

developed in lectures nine and ten with regards to hypothesis testing.

And we're going to look at situations where we can compare parameters, whether

they be population level means, proportions or incidence rates, between

more than two populations using data from more than two samples in one test.

So, in this first section we're going to look

at the situation where we want to compare means

of a continuous outcome between more than two populations.

It's an extension of the two sampled unpaired t-test and is

called Analysis of Variance and frequently referred to by its nickname ANOVA.

So you may say well, if you're comparing

means, why is the algorithm called Analysis of Variance?

Because variants refers to variability. Well let's

think about it for a minute.

Let's go back to the t-test, which is a specific

case of analysis of variance when we only have two groups.

If you think about what we do when we compute our distance metric,

we look at the distance between the two group means and the numerator.

Which is a measure of how much the group

means vary between the two groups we're looking at.

And we divide it by the expected

variability of this difference estimate around zero,

under the null hypothesis.

So in some sense, we're comparing variability to variability.

And the analysis of variance for more than two groups just extends that

idea, and creates a similar distance

metric that unifies it across multiple groups.

So in this lecture section you will learn to interpret a p-value

From this hypothesis test for any

mean difference between more than two populations.

The test is testing for any

mean differences between more than two populations.

And this method, forgetting the p-value is, as I said

before, called the Analysis of Variance or by its nickname ANOVA.

So let me give you this first example from a study done in the late 1970's, where

researchers looked at the relationship between smoking and measures of pulmonary

health, including mid-expiratory flow.

And the researchers recruited study subjects and

classified them into one of six smoking categories.

they had non smokers.

Passive smokers were those that were exposed to secondhand smoke.

Non inhalant smokers, light smokers, moderate smokers, and heavy smokers.

And to start, the researchers were interested in whether

there were any statistically significant differences in pulmonary outcomes.

Such as FEV1, at mid extractory flow, et cetera, between the six underlying works.

And whether they to compare, for example, mid experatory flow, which

is measured on a continuum, and they have these six groups.

If they only knew about the two sample

comparisons we have done this far, they would need

to do lots of two sample t-tests. For each possible two-group comparison.

And if you enumerate the number of possible unique two-group

comparisons from the six, there would be 15 unique comparisons.

Non-smokers to passive smokers.

Non-smokers to non-inhaling smokers and so on.

So, that would be labor intensive.

And it wouldn't give a unified picture of what was

going on.

So there's another method that we can use to extend the

two sample t-test to compare means between more than two populations.

And this is called Analysis of Variance. Sometimes called ANOVA, or one way ANOVA.

And the one way indicates that we only have one predictor or grouping factor.

In this case, it's smoking.

There sometimes you'll see something called two-way ANOVA,

which allows for two grouping factors to look at, compare across.

So for example, smoking and sex of the person.

And we'll look at some examples of two-way ANOVA in the second term of this course.

But for now, as we've done thus far, we're only looking at

one predictor or one grouping factor which in this example is smoking.

The general

idea behind ANOVA for comparing means for k-populations, and I'll just

generically say where k is a number greater than two, is the null hypothesis.

Is that all of the population level means for the k groups are equal.

And we could phrase this, it would be a little harder,

to do succinctly but, in terms of all possible, unique mean differences.

So the overall null is that, the underlying population means are all

equal and any two way difference is zero. But this

is the standard way to state the lump

[UNKNOWN].

And then the hypothesis, alternative hypothesis is that at least one

population mean is different from at least one other population mean.

So if we fail to reject this you know, we're making the conclusion that our

results are not unlikely if these data

came from populations with the same underlying means.

But if we do reject, then we're only making the conclusion

that at least one population is different from at least one other, another.

And we don't actually get information about what means ours statistically

different than each other and what the magnitude of the differences are.

So let's go back to the smoking mid expiatory flow example.

We're going to focus on the mid expiatory flow although they

[INAUDIBLE]

other pulmonary outcomes as well.

And so, from a pool of over 5200 potential participants,

a random sample of 200 men and 200 women was drawn from each smoking group.

So they enrolled a bunch of people to be potential parti, participants.

And then they had these people self classified them into

one of the six smoking groups based on their smoking habits.

And then they randomly sampled 200 men and 200 women from each smoking group,

except for the non-inhalers because there were so few amongst the 5,200.

So they took 50 men and 50 women from that group.

And then they took pulmonary measurements

on each of the subjects, including this mid expiratory flow, or FEF, of 25 to 75%.

Here's

a table of some of their summary statistics.

So they had the groups labelled here.

And then they had characteristics on each of

them, including their age, their height, and then

their pulmonary measures, FBC, FEV1, and then what

we're focusing on to illustrate this mid expiratory flow.

And I'll just represent this data to make it a little

easier to see because we'll only focus on the

FEF 25 to 75% or the mid expiratory flow.

Now you can see here, just exploratorily, there seems to

be somewhat, at least in the samples of a dose, response.

The greater the degree of smoking, so as we go from non-smokers to heavy smokers,

as expected the lower the mid expiratory flow in liters per second.

So least in these sample data not only does it look like there's

potentially differences between smoking for example, but

it's almost of a dose response nature.

So, if we wanted to test one or

any of these differences were in fact statistically significant.

We wanted to account for the uncertainty in the sample estimates before we make a

strong conclusion about FEF being related to smoking

level, we can do an analysis of variants.

So the null for the analysis of variants is that

the mean FEF for all six smoking groups are the same.

And the

alternative is that at least two of the six groups have different means.

So, if you do this test you get a p-value of less than 0.01.

So what is the conclusion

here, well in our

standard 5% level

this suggests that if

these six samples had

come from populations

with the same mean at

the F values. Then the

chances of getting these study

results are very small.

Getting

these study

results, or results

even less likely.

Are very small or less than one in a hundred.

So, our conclusion to 5% level would be to reject this.

And conclude that at least some of the

smoking group means are statistically different than others.

So you might think well, down to I have to

go back and actually find out where the differences are?

Well, that's one possibility.

Now we could actually look and do t-test for

each comparison to see where the statistically significant differences are.

My take on this is that, the p-value coupled with that decreasing

mean as a function of increasing smoking level.

So decreasing pulmonary function is increasing smoking

level coupled with that statistically significant result.

Gives an overall picture that there is a statistically

significant association between reduced pulmonary health and greater smoking.

What the authors did is they, they actually used

the ANOVA approach for each of their pulmonary outcome comparisons.

And then

they did a little more

specific comparisons to look at where the biggest differences were group to group.

So, here's what they say about this.

When we look to the extent to which

smoke exposure is related to graded abnormality, we

found that non-smokers in smoke free working environments

have the highest scores in the spirometric tests.

So that includes FDF and some of the other measures they took.

Passive smokers, smokers who do not

inhale and light smokers score similarly and

significantly lower than non-smokers. And then heavy smokers scored the lowest.

So here they give a p-value from an ANOVA for each of their

comparisons the tre, for each of the test they did, FEV

FVC, and the mid expertory flow and the FEF 85% to 95%.

The results were statistically significant with p-values less than 0.005.

But, they also did some post ANOVA analyses

to look at where the differences were

most note notable statistically speaking. And they sort of found three clusters.

That the non-smokers did the best. Followed by the cluster

that includes passive smokers, smokers who do not inhale, and light smokers.

And then, medium and heavy smokers scored the lowest.

But the overall picture and the overall conclusion

they give is that, we conclude that chronic exposure

to tobacco smoke in the work environment is deleterious

to the nonsmoker and significantly reduces small airways function.

So in general they've found that smoking

was bad but they also highlighted the role of

secondary smoke and it's impact on people exposed to it.

Here's another example, more than two populations being compared.

This again, is for pulmonary outcomes, but we

can use ANOVA for other outcomes as well.

So this was a study on pulmonary outcomes, done at three medical centers.

And included a total of 60 patients from three medical centers.

So there were 60 patients with coronary artery disease from Johns Hopkins,

Rancho Los Amigos Medical Center and the St. Louis University School of Medicine.

The purpose of the study was to investigate

the effects of carbon monoxide exposure on these patients.

And prior to analyzing the carbon monoxide effects data, researchers wanted to get

a sense of how the respiratory health of these patients compared across the three

medical centers.

Ostensibly their life would be easier if the respiratory health

was comparable prior to the exposure of the carbon monoxide.

So here's, I was able to get my hands on this data and here's

actually some box plots of the FEV1

measurements prior to exposure to carbon monoxide.

For the patients at these three medical centers.

So you can see there's only 60 patients to begin with, so these samples are small.

So here's John Hopkins with 21 patients. Rancho Los Amigos with 16 patients.

And then Saint Louis University Medical Center with 23 patients.

And our eyes can be playing tricks on us because of scaling but it does

seem that at least visually speaking there

are some differences in the distributions of these

FEV1 measures.

Namely between John's Hopkins and Rachos Los Amigos.

But again, these are based on small samples of

data, so this could just be because of sample variation.

So, what the researches did was they did an analysis

of variance, testing the null that these mean, mean base-line FEVs.

Or equivalent at the patient population level, versus the alternative

that at least one mean was different than at least one other mean.

And their p-value for this, is 0.052.

Wow. Wow, that's crazy.

Isn't it?

It's so close to being statistically significant.

But yet,

technically it's not. It is not less than 0.05.

So we cannot call this result not less than 0.05.

So this result is at that point 0.05 level not statistically significant.

However, something we should think about here and we'll discuss in more

detail in the next set of lectures is the small sample sizes.

And one

of the reasons we may not have found the statistic significant

difference is because we didn't have the ability to detect one.

So it's sort of the jury's out, as to whether or not we fail to reject the

null, because there are really no differences at the

patient population level or because we couldn't see it.

So, if you're strictly doing a letter of

[INAUDIBLE],

cut off here in considering whether you need to factor in these baseline

differences when actually looking at the results

after exposing the patients to carbon monoxide.

You could play naive and say, well the p-value

was greater than 0.05, the results were not statistically significant.

So we don't have to adjust or account for

them when looking at our results after the exposure.

But, I would

advise in a situation like this.

With such small sample sizes, that you would want to look

at the resulting FEB measurements after exposure to carbon monoxide, both

on their own and then, as we learn how to

do in the second term, adjusting for the starting FEB measurements.

So how does ANOVA work? Well, it's the same approach conceptually.

We assume the null hypothesis is true, that all

means are equal for the population being compared by examples.

What then the null that goes on to do is computes a measured discrepancy

between what was observed in the samples

compared to what was expected under the null.

So this is very much in line of what we've done with every hypothesis test thus far.

This measure of discrepancy is sometimes called the F-statistic.

And we won't show how to directly compute this, but you could think of it as

an extension of the two sample t-test statistic

that allows for comparing differences between multiple samples.

And then this measure of discrepancy that we get for

our samples, is compared to the distribution of such measures.

Across or under random sampling variability when the

null hypothesis is true.

So we basically again look at where our result falls relative

to what we could have expected to get just by sampling variability.

When the null is true then figure out

whether we're part of the majority or an outlier.

And the thing that ultimately tells us where we fall in the pack is the p-value.

It tells us the, chances of being as far or farther

away than we are under the null.

So for this example with the FEV, in medical centers the F-statistic is 3.12.

I don't know what that means off the top of my head and these F-statistics do not

have an easily interpretable distance metric like number standard errors.

So to get a p-value for these things is it's

very hard to look at the F-statistic alone and make

a decision about whether it's statistically significant or not

you would need to go to computer or an F-distribution.

F-distribution is a high maintenance distribution.

It has two sets of degrees of freedom.

The numerator and the denominator. And in these ANOVA comparisons with

k groups, the numerator degrees of freedom is the number of groups

we have minus 1. So for our example it's 3 minus 1 or 2,

and the denominator is the total sample size minus the number of proofs we have.

So we had a total of 60 individuals across

three groups, so the denominator degrees of freedom is 57.

The only reason I point this out is a lot

of times you'll see in papers, people re, report the

value of the F-statistic and then tell you

which distribution they need to look it up on.

That's actually to my purpose sort of

ancillary information because it neither tells me what

I want to know, which is how unlikely are

my results or their results under the null.

But this get converted, this, these F-distributions kind

of like the Chi square are skewed and confined

where 3.12 occurs on distribution. And figure out what percentage or

observations are as far or farther away, as likely as unlikely in either direction.

And convert that to a p-value and that's the p-value they got at 0.052.

So this is really the, the result you really want to see, not the interim steps.

And again the idea's exactly the same though

this F-statistic is nothing more than a distance measure.

It's just not easy to interpret on, on its own right.

So let's look at one more example of

where we have ANOVA, and frequently these are used

just to sort of give some understanding of characteristics that differ

between subgroups in our population as being compared.

So this is the academic physician salary study we've looked at before where

the goal was compare the mean

salary between male and female academic positions.

And one of the things they wanted to account for in this study was

other differences between males and females that

may have been related to salaries as well.

So in this table here.

And this is a very common type of table to see in this study.

They present and they call it

bi-variable associations between salary and measured characteristics.

And they look at the average salary in different subgroups of the entire sample.

So for example.

They look at the salary differences on average between groups

rank by their National Institute of Health funding into four levels.

And then they present a, the mean salary for

each group and they give confidence intervals but then they

present an overall p-value testing the null that the mean

salaries are not different between any of those four groups.

And this comes

from ANOVA.

So I'll explore this up for that, this is just to show you two examples

from this table but this is very common this is a nice presentation.

They give the means for each of the four groups they're comparing.

They give the confidence intervals so we can eyeball and see where

the largest differences are after accounting

for sampling variability and then they

summarize the comparison for this overall p-value.

And so this gives a heads up that there is some association between

current institutions National Institute of Health funding

and the salaries paid by the institution.

If this characteristic is also related to being female, i.e.

What where the institution ranks?

That the distribution of females is different between

institutions with different rankings then the authors are going to want to

adjust for that in their analysis when they compare males and females.

And well again get into adjustment in the second term.

They also looked at how salaries differed by

institutional regions they gave means and confidence intervals.

Can see this comparison was not statistically significant.

So it's very common to see the results

from ANOVA presented in this way. For looking at the relationship

between an outcome of interest and multiple different predictor variables

one at a time. So in summary, ANOVA

is just an extension of the two sample t-test that allows us to compare

multiple means from multiple populations with one test.

It only gives us a p-value, there's no confidence

interval or measure of association we can give to present

the overall differences between these four means.

But if we do find statistics significant differences

in our interest in figuring out where they are.

the researcher can go back and do.

The appropriate t-tests for the comparison then.

[SOUND]

In the next two sections we'll look

at extending the comparisons to

multiples populations for binary

outcomes in time to have been outcomes.