A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

來自 Johns Hopkins University 的課程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

235 個評分

Johns Hopkins University

235 個評分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

從本節課中

Module 4B: Making Group Comparisons: The Hypothesis Testing Approach

Module 4B extends the hypothesis tests for two populations comparisons to "omnibus" tests for comparing means, proportions or incidence rates between more than two populations with one test

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So, in the next sections, we'll look at

the role of sample size, power and detectable

difference of interest at the roles that these

have on each other when designing the study.

And I want you to understand the relationships be

at, by these and what effects what in which direction.

I'll be showing you the results from computer software I use, the statistical

package stated to estimate the sample sizes

and power under a variety of conditions.

And I don't expect

you to be able to replicate these results with stata or by hand,

but I want you to appreciate the influence each factor has on each other.

However, I am going to point you in the direction of a free application

download that you can get, which is the sample size and power calculator.

And it may be of interest to you to

download this and play around with the inputs just

so you can get further reinforcement of the relationship

between these different quantities and the role they have

on each other.

So, in this last lecture set for Statistical

Reasoning 1, we're going to talk about designing studies

to have a desired power, and the ideas are similar to what we did in Lecture 12.

But this approach is more commonly used

for studies that are designed to compare populations.

So in this set of lectures, the relationship between sample sizes

and precision will be re-expressed through the window of study power.

It is more common to design a study to have a certain level of power, 80 or 90

percent, than for a desired margin of error, especially

when we're comparing populations, but the approach is analogous.

In these lecture sets, power and its influences will be explored, and some

examples of designing a study to achieve a certain power level will be given.

So, this has a very sexy title, Power and Its Influences.

But I'm only [LAUGH] allowed, or qualified perhaps, to talk

about statistical power and its' influences, not power in general.

So let me give you an example of a study with low

power, and we'll refocus on the idea of power in this lecture.

So consider the following results from a small study done on

29 women, all between the ages of 35 and 39 years old.

So a random sample

of 29 women was taken from a clinical population, and then the women were

classified as to whether they were currently

using oral contraceptives or not at the time.

And so eight of the 29 women were using oral

contraceptives at the time as compared to 21 who were not.

And the researchers measured their blood pressures, and then wanted

to make a comparison between the blood pressures of those

on oral contraceptives, at the time, to

those who were not using oral contraceptives.

And so here are summary statistics on each of the two samples.

The average blood pressure among the oral

contraceptive users was 132.8, as compared to 127.4

in the other group, and there's estimates of

the standard deviation based on the sample results.

So, what the researchers were particularly interested in looking at

with the study is whether oral contraceptive

use is associated with higher blood pressure.

So, statistically speaking the researchers were interested in

testing the null hypothesis, that the underlying population level

blood pressures between the women using oral contraceptives and

not are equal, versus the alternative that they're not.

Or, as we like to express things in terms

of differences, the null is that the mean difference at

the population level is zero, versus that it's different than zero.

And so again here are the study results, and ultimately what came out of

this was the sample mean difference in

blood pressures is 5.4 millimeters merc, mercury.

5.4 millimeters mercury higher, on average, for

the women who are on oral contraceptives.

That's a, a sizable difference, but, of course,

this is based on very small samples, and if

they actually did the 95% confidence level, you see it

goes from negative 8.9 millimeters of mercury, all the

way up to 19.7 so that's very wide and inconclusive.

It includes the null value zero, and the corresponding P value is 0.43.

So the decision here, in hypothesis testing,

would be the ambiguous fail to reject

the null, and it's especially ambiguous because this study is small.

And it's not clear whether this is because

there is, the null is true or because there

was so much uncertainty in the data that we

couldn't see differences based on such small sample size.

So suppose you, as a researcher,

were concerned about detecting a population level

difference of this magnitude, on the order of five

millimeters of mercury on average, if it truly existed.

Well this particular study of 29 women had

low power, to detect a difference of such magnitude.

It's chances of detecting a difference of at least 5.4 millimeters of

mercury, were it really the truth of the population level, was low.

So just to remind us what power

is, and I just sort of alluded to it in the last slide.

Recall the table comparing the underlying truth that we can't observe to decisions

made by a hypothesis testing, so we're very familiar with this first situation.

If the null is true, but we end up

deciding to reject the null, we've done the wrong thing.

We've made a Type 1 error, and that's our alpha level of the test,

the level we are willing to tolerate for that.

However, if we reject the null, when the

alternative is true, that's a good thing, and the

chances of doing that for some alternative for a

given study is called the power of the study.

For a given study of a given sample size,

what are the chances of finding a significant difference for

some specified alternative difference value given the size

of the study? So power is a measure of doing the right

thing when the alternative hypothesis is the truth that generated our samples.

And certainly for a study higher power is better, but it comes at a cost.

[BLANK_AUDIO]

So why is higher power better?

Well, with higher power studies, going into it,

it's been designed such that if there is a

real difference at the population level, the study

of that size has good opportunity to see it.

And we end up failing to reject the

null with a high powered study, it's much clearer

to go with the idea of the null

being the underlying truth because there's not that ambiguity

about our inability to find a difference, did it really exist?

So when a study with low power

finds a non-statistically significant result, it is

hard to interpret this result as I

said before, it's ambiguous, we don't know whether

we failed to reject the null because the null is true, or because we

just had so much noise or uncertainty in our data that we couldn't see differences.

When a study has high power,

a non-statistically significant result can be interpreted

more confidently as no association, which is an important finding in research.

So,

just to give you an example of the study power for the

one we just looked at, the oral contraceptive blood pressure study has a

power of 13% to detect a difference in blood pressure of 5.4 millimeters or

more between the oral contraceptive users and the non oral contraceptive users.

If the difference truly exists in the population of women that were sampled.

So, in other words, study is based on only 29 women from this population.

Only about one in ten.

A little more than one in ten or 13% would actually pick up a

difference in the population level of 5.4 or more if that's really the truth.

So this means that our study had very low opportunity to see

a difference of perhaps substantive interest, if it were really the case at

the population level. So where does power come from?

How do we actually compute it?

Well I'll give you the idea behind it.

So recall, something we worked on a lot in this coarse.

The sampling behavior of estimates comparing two samples

(mean differences, or risk differences) or the log

of estimates comparing two samples when their ratios

is normally distributed in large samples with the sampling

distribution centered at the true difference of interest.

So whether it be a mean difference, a difference in proportions, et cetera.

So under the null, our null hypothesis with regards

to these differences is that the difference is zero.

If the null is the truth then this curve

sampling distribution is centered at the truth of zero.

So for designing a study

to have a certain power, or estimating the power of a completed study, we have to

be specific about the value of our alternative,

and this is where it gets a little trickier.

When we do hypothesis testing, our alternative is very vague,

it's just that the difference of interest is not zero.

But in order to actually compute power or design a study to have

power, we have to be more specific and talk about alternatives that we'd

be interested in seeing, put a lower, bound

on the difference that would be, substantively interesting.

So if we're doing a hypothesis testing comparing two

groups, as we know, the null and alternative are,

no differences in the populations from which the two

samples came, versus the very vague there is a difference.

[BLANK_AUDIO]

[SOUND] So in order to actually look at power, of

an existing study or design study to have a certain

power, we have to actually get specific about a minimum

difference that we are interested in seeing as a researcher.

So for example in the blood pressure oral contraceptives study,

I may not be interested if the difference between oral contraceptive

users blood pressure and those who didn't is on the order of half a

millimeter of mercury, or one millimeter of

mercury, cause that isn't very clinically significant.

A small shift up.

I may only be interested in seeing differences if they are

at least in the order of four or five millimeters of mercury.

So I don't want to actually spend the resources to see smaller differences.

There's a minimum idea of what

would be scientifically interesting.

And I have to specify that in order to look at the power

of a study or compute a new study to have a certain power.

So let's just talk, briefly.

I'm going to draw some cartoons, and then I'll animate them, so that

you don't have to suffer through

my drawing skills throughout this entire lecture.

So let's look at what we know about from sampling distributions.

We know that forget most

of the differences we can look at.

If we looked at them across multiple studies of the same size like a

mean difference or difference in proportions or

the law of relative risks, et cetera.

The estimates if we plotted them in a histogram

would be normally distributed and centered at the truth.

And if our samples come from populations with equal measures

where the difference is zero, if the null is the truth

then this sampling distribution will be centered at zero.

So I'm just going to redraw that, so under the null our sampling distribution

of our estimates there'll be some variability, but it'll be around zero.

However, if there's another truth out there, the null is not

actually the truth, the alternative is true that there's some difference,

and now we're going to specify what some difference could mean.

We'll say there's some difference, and we'll call it d.

D could be one millimeter mercury, or 10%, or some specific number.

Then if that were the case, what we'd really have, behind the scenes, is the

alternative is true and then the sampling

behavior of our estimate would be normally distributed.

It'd be the same curve

here, but it would be centered,

at this alternative value of the difference, the

true value in the population that's not zero.

So, what does this mean for us?

We are going to make a decision, to reject

the null or not, based on this first curve.

We're assuming the null is true.

This black curve describes the sampling behavior over estimates under

the null, and we're going to make a decision to reject

at the five percent level, if our estimate from our

study comparing the two samples, from the two populations is outside

of two standard errors from our null value of zero.

So that, if it's not more than two standard errors away we will not

reject, but if it is more than two standard errors away we will reject.

So, what is power? Well if in fact our data actually

comes from populations with a difference at least as large

as d, one we'd specify in advance, and we'll get to where that comes from.

Then, then our power is, the probability rejecting based on

this black curve, distribution of the null, when in fact,

this curve describes our sampling behavior because the alternative is true.

So this area in here is the probability of getting

a result that's more than two standard errors away from zero

when the samples come from a population where the actual

difference and the quantities being compared is this alternative value d.

So, let's think about pictorially what governs the power of a study.

Suppose we have designed a study, and looked

at, what the power was under design, and decide we want the power to be larger.

What could the researcher do to make the power larger?

In other words, what could the researcher do to increase this area

as a proportion under the blue curve?

Well one thing they could do, and I'll just click this slide here to try

and animate it is, they could actually make

the expected difference larger, alternative hypothesis value bigger.

Make it further from zero which makes it more likely,

to reject, if these data come from the blue curve.

Another thing

the researcher could do, if they didn't want to mess with the difference and

make it larger, they already had it about as large as it could be,

and making it any larger would incur missing some differences of

interest, they could actually increase the sample size in each group.

And what effect would that have?

Well that would reduce the uncertainty in the estimates, and

it would make those curves tighter, and so they'd be

easier to distinguish between, and the blue area or the

power would increase, because of the decrease standard error around

our estimates.

The last thing that a researcher could do is make it easier to reject.

Increase the alpha-level of the hypothesis test,

functionally speaking, make it easier to reject.

So here is our picture, with the five percent rejection level.

[BLANK_AUDIO]

But if we increase that, what we're going to do is, increase

the region under the black curve where we would reject a null.

And that's going to, in turn increase the proportion,

or chances of doing so under this blue curve.

Now practically speaking, what do you think is

acceptable versus not, in the world of research?

It's okay to toy around with the difference.

It's okay

to play around with the sample sizes, but changing

the alpha level is not practical, because, most funding

agencies would not accept a study designed to have

power with a rejection level of greater than five percent.

And consequently, most journals will not be happy with papers that are submitted

under that design, based on a rejection level of greater than five percent.

So here's the deal, what we've done thus far

is just laid out a conceptual idea of power.

Power can be computed after a study is completed.

So for example, this is done sometimes with smaller studies to try

and understand why a non-statistically significant

difference was found, and to see

whether low power is an issue which might open up an opportunity

for someone to build on the research and do a larger study.

But power can only be computed for specific alternative hypotheses.

For example, with population mean differences the this study had X

percent to detect a difference in population means of Y or greater.

So in order to compute the power of a

study that's been done, one would have to specify

the minimum difference, in the measure of interest between

the two populations that the study was trying to detect.

And so you'll sometimes

see this presented as an excuse for non-statistically significant findings

if the low power with low, with the power is low.

So the lack of statistically significant association between A

and B could be between because of low power.

Maybe less than 15% could detect a mean difference of Y or

greater or a difference in proportions or whatever the measure is used.

It can also be presented to

corroborate with the non-statistically significant result.

In other words, to try and understand what the reasons for that may be.

The industry standard for power going forward and designing a

study is 80% or greater, so sometimes if a smaller

study is published with low power another researcher will say,

well, the results look interesting in the small sample study.

I'd like to design a bigger study with power

of 80% or 90% to look at the same comparison and answer the question.

So what we're going to explore in the next couple lecture sets

is many times in the study design, a required sample size is

computed to achieve a certain preset power level defined a clinically or

scientifically minimal important difference in means,

proportions, or incidence rates, or ratios.

And again the industry standard for power is 80% or greater.

And we'll see that going

into this, this is a little bit of a game,

because the power of the study to detect

the difference between populations on the appropriate measure of

interest, is a function of the size of the

study samples and the minimal detectable difference of interests.

So when designing a study in advance, researchers need to incorporate these

elements into design while recognizing practical

considerations such as budget and personnel.

So if the first attempt, a design in power, study to have power

to find a certain difference yields really large necessary sample

sizes that are out of the funding range, the researcher needs

to go back to the drawing board, perhaps consider making the

minimal detectable difference of interest larger, to increase the power without

having to increase the sample sizes so greatly.

So, we'll look at some examples of this in the next two sections.