A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

來自 Johns Hopkins University 的課程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

237 個評分

Johns Hopkins University

237 個評分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

從本節課中

Module 4B: Making Group Comparisons: The Hypothesis Testing Approach

Module 4B extends the hypothesis tests for two populations comparisons to "omnibus" tests for comparing means, proportions or incidence rates between more than two populations with one test

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

All right, in this section we're going to look at the issue of sample size

computations for studies comparing two or more means.

We want the study to have a certain level of power

to detect a difference of interest.

So upon completion of this lecture section, you will be able to

describe the relationship between power and sample size with regards to the size

of the minimum detectable difference in means between 2 groups.

Describe the relationship between power and sample size with regards to

the standard deviations of individual values in the groups being compared.

And understand the impact of design studies to have equal versus

unequal sizes on the total sample size necessary to achieve a certain power.

So let's look at our example that we used in section eight to motivate

the idea of power.

So suppose we have the data on oral contraceptives and

blood pressure in a sample woman age 35 to 39.

So recall the data.

We had 29 women, eight of who were currently using oral contraceptives at

the time of the study, versus 21 who were not.

And we had the sample mean, blood pressures and the sample deviations.

So we think this research has a potentially interesting association,

shows evidence, potentially, of an interesting association.

But of course, the issue was the small samples sizes led

to a large margin of error and low power to detect an interest in difference.

So we want to build on this and design a bigger study, but

we want this larger study to have ample power to detect an association of interest

should it really exist in the population

of 35 to 39 year old women with regards to oral contraceptive use and blood pressure.

So what we want to do going forward is design a study.

And we want to determine the sample sizes needed to detect

about a five millimeter increase in blood pressure

in oral contraceptive users relative to those women not using oral contraceptives.

And we want to have this with 80% power at our standard rejection level 0.05.

And using that pilot data, we estimate that the standard deviations of blood

pressures are 15.3 millimeters of mercury and 18.2 millimeters of

mercury in the oral contraceptive and non-oral contraceptive users respectively.

So here, we have a desired power in mind, and we want to find the sample size

necessary to achieve a power of 80% to detect a population difference in

blood pressure of five or more milliliters of mercury between the two groups.

So we can find the necessary sample sizes of this study

if we know in advance our alpha level of the test.

Which is easy, it's going to be 0.05.

If we have specific values for the true underlying means in the two groups being

compared such that, really, the important thing is to know

the difference of interest between these two means.

And this usually represents the minimum scientific difference of interest.

We also have to estimate the standard deviations of the blood pressure

measurements in both groups being compared.

And then we have to know our desired level of power.

And to start, we'll use a power of 80%.

So where does this idea of a minimal detectable difference come from,

and how do we estimate the population SDs?

Well, the minimal detectable difference is something that the researcher

would consider the minimum difference to be scientifically interesting.

For example, in this blood pressure study, it could be the case that the average

blood pressure difference in the population between oral contraceptive

users and non-users is on the order of 1 or 2 millimeters of mercury.

But as a researcher, we don't see that as being clinically useful of relevant, or

a very strong finding.

So this minimal statistical difference has to come from our knowledge of

what would make for a interesting difference of the population level.

And then where do these estimated population levels SDs come from?

Well, again, researcher knowledge experience makes for

good educated guesses, but hopefully there's other study out there,

maybe a pilot study for example that we have,

are privy to in this case, and we can use that as the starting point.

So let's fill in the blanks from the pilot study data we have on blood

pressure on oral contraceptives.

We know we want the alpha level test to be 0.5, but

that's not a function of the pilot study.

We have, if we're shooting for the a minimal detectable difference of 5

millimeters of mercury, we can estimate means for the two groups that are similar

to the means we saw in our pilot study but have a difference of 5 mm of mercury.

So I'm just going to say 132 for the oral contraceptive users, 127 for

the non-users.

What's really important here is not the value of the individual means, but

the difference we want to see.

Then we have to have the estimates of the standard deviations,

which we have from the study of 15.3 for the oral contraceptive users and

18.2 for those women not using oral contraceptives.

And then we have to know the power that we desire

to detect the difference of at least 5 millimeters of mercury.

And that will start with an 80%, and then we'll look at raising it to 90%.

So given this information, how can necessary sample size be computed?

Well you can certainly use statistical software such as Stata, or SPSS, or Sas.

There are some free online sample size calculators.

If you just do a Google search, you'll get some hits.

A favorite of mine, is one that you can actually download and

put on your computer, and it's pretty intuitive and

user friendly, from Dupont and Plummer, statisticians at Vanderbilt University.

Theoretically, we could also do this by hand.

It's a little cumbersome.

Just for those of you who are interested, at the last lecture set of this

Section 13, I'll show you an example of doing it by hand.

But only for those who are interested.

It's optional.

So let's start.

We have think about our study design.

And for the first approach,

let's assume we want equal numbers of women in each group.

In the original clinical sample, only about a third of the women

were using oral contraceptives, a little less than a third.

So let's oppose for our study design, instead of taking one random sample from

the clinical population of 35 to 39-year-old women, and

then classifying each woman as currently taking oral contraceptives or

not currently taking oral contraceptives,

our approach would actually require taking two samples of women separately.

We'd first classify the woman in the clinic as whether they're on

oral contraceptives or not.

And then take equal samples numbers of women from each of the two groups.

So if we do this and we run this data that we collected in the previous slide through

statistical software, turns out we would need 178 women in each of the two groups

for a total of 356 women to have 80% power

to detect a difference as large or larger than 5 millimeters of mercury.

This actually, even though we've been thinking our hypothesis,

based on the sample results and what we're thinking clinically,

is that oral contraceptive users will have higher blood pressures.

This actually covers us in situations where if the groups

being compared were such that the oral contraceptives users had Lower blood

pressures on the order of 5 mm Hg or even lower than the non-users.

So this is actually covering us in both directions.

So the margin of error, if we had 170 women in either group, just to give us

some sense of what the related precision of our estimated mean difference is,

is plus or minus 3.6 millimeters of mercury.

Suppose we found, well you know what,

clinically speaking a difference of five is larger than necessary.

We might be interested as researchers if the difference is on the only four

millimeters of mercury.

So let's, if we rerun the numbers but made our minimum detectable difference smaller,

what do you think's going to happen to the sample size?

We're making it harder to see in some sense.

Well, as you may suspect, when we do this, the number we'll need in each

group is larger than when our difference was five millimeters of mercury.

And we'll need 278 women in each of the two groups.

So that's 100 more in each group than

when we had the minimal detector difference of five millimeters of mercury.

Suppose we said, well, let's just play around and

see what the impact is on changing the minimal detectable difference in the other

direction, let's make it larger and therefore easier to see, if you will.

How many women will we need in each group?

And if we run the numbers, we need 124 women in each group.

Substantially less.

More than 50 less in each group then for

the minimal detectable difference of five millimeters of mercury.

So, playing around with this minimal detectable difference will influence

the numbers we need in each group.

So, if a researcher was writing up a grant proposal, he or

she may include a table like the following.

Usually it is not acceptable to just come up with a one computation,

you want to actually vary some of the parameters or inputs and

show what happens to the necessary sample size.

So it might be something like this.

You might make a table where you look at the relationship between necessary sample

size to get 80% power for various detectable differences.

Here I'll do four, five and six millimeters of mercury.

And then, because our standard deviation estimates are just estimates from in this

case, a small study, there's going to be some variability in those.

So we may play around with situations where the true standard deviation is lower

in each group than what we observed in the sample, and

where it is larger than what we observed from the sample.

And this gives sort of a robust analysis that shows that we've thought

about potential scenarios and are not just wanted to one exact scenario.

And it gives the 20 agents of sense of the sample size is needed for each and

then they can make a decision about whether the study is robust enough and

how much they're willing to fund.

So, and when you look at this table,

and I want you to look at the impact in two directions.

First, for any given set of standard deviations assumptions, look what happens

across the table as we increase the minimal detectable of interest.

And as we showed before, the larger that becomes, the easier it is to see,

and the necessary numbers in each group decrease.

If you go down each column for a fixed minimal detectable difference,

what you can see is the variability estimates

of the blood pressure measurements in each of the two groups increase.

There's more person to person variability which will increase our standard error,

the necessary sample sizes increase with that.

So if I were writing to a funding agency, I would present a table like this and

then ask for something like funding for 300 women in each group.

And that would pretty much cover all of the situations.

We might be a little low in this one scenario with

the smallest detectable difference and the greatest variability.

But this table would show that with 300 women in each group,

I could pretty much find any differences with 80% power, at least,

under the scenarios that I've played out here.

Suppose the funding agency reviewed the grant application but

said, we're interested in you doing the study for

90% power and we'd like to see the same computations redone for 90% power.

And we could do that and if you compare all the values in this table where we

have higher power we want to be more sure of rejecting when we should.

You'll see that all the corresponding values for

the combinations of detectable difference and

variability estimates increase relative to what they were with 80% power.

And that makes sense because we're actually putting

a higher onus on our ability to pick up a difference.

And we're going to need more information to get smaller standard errors needed to

increase our likelihood of detecting the difference

under these alternative hypothesis scenarios.

But you might say,

well, I would prefer to do the study the way the original pilot study was done.

I would rather sample from my clinical population and

then classify each to her current OC usage.

So in other words, instead of taking separate samples

from women who are currently using oral contraceptives and those who aren't and

assuming the sizes will be equal like we did in the last scenario.

This approach would involve taking one sample and

then classifying the woman after they've been selected to be in the study.

So in the original small study,

8 of the 29 women were currently using oral contraceptives.

That was 28% of the sample.

So purposes of designing this study, let's use 30% to assume

that when we take this overall sample, 30% of the women will

be using oral contraceptives, and the other 17% will not.

And so we will have to design a study that recognize the uneven

sample sizes we desire.

And this can be easily done with statistical software,

you can add in a piece about the relative frequency

of participants in each of the two groups being compared.

So, if we run the numbers on this, if we wanted to go with our

original goal of detecting a difference as large, or

larger than five millimeters of mercury in either direction, a mean difference.

We need 119 women in the oral contraceptive group,

and 274 in the non oral contraceptive group for a total of 393 women.

So notice that this total sample size here is larger than

when the study was designed to have equal sample size.

Why is that?

Why do you think that is?

Well, what these computations are doing, given the information necessary,

is solving for the,

they really are doing what we did before in terms of solving for the margin of

error through the standard error necessary to have the power of interest.

So, if you remember the standard error for

a difference in means comparing two groups.

Is the standard deviation of the values in the first group squared over

the sample size of the first group plus the standard deviation of the values

in the second group squared over the sample size in the second group.

When one of these samples is smaller than the other,

the rate limiting factor in how big the standard error's going to

be is going to be a function of the smaller sample size.

And so we're going to need more women or

more subjects total to overcome the fact that we have.

One of the sample size is being smaller than the other.

If the sample sizes were equally split,

we would need a fewer numbers to get the same standard error.

So that's why when we go for an imbalanced sample size design, we

need more total subjects than if we were to assume equal samples in growth group.

So, something you think about in study design for

you are able to control how the sample is done, whether you can do

situation where you purposely choose equal numbers in each group.

Each sample or your constrain to do it such that you have to classify

subjects to their group membership after you sample the entire group and

there may be an imbalance in the numbers in each group.

So in this situation, if we had changed the minimal detectable difference to 4

millimeters of mercury and this is the situation where we expect 30%

of the subjects to be using oral contraceptives and the remaining 70% not

to, we'd need 186 women within the oral contraceptive group and

that 428 in the group not using oral contraceptives for a total of 614.

So this is notably larger, again,

than when we actually did this assuming equal sample sizes.

And of course,

because our minimal detectable difference is smaller in this scenario.

Four versus five, we're going to need more subjects than with the situation

where we had a detectable difference of five.

If we actually did this computation, assuming that 30% of

the women sample would be in the oral contraceptive group and

the remaining would be in the non oral contraceptive group,

we'd need 83 women in the oral contraceptive group and

191 in the non-oral contraceptive group for a total of 274r women in our study.

And you can create a similar table to what we had done before for

this invalid sample size, showing it as a function that the number needed in

one of the groups and then putting a footnote that the number

needed in the other group would be a proportion of the first group.

So you might say, well, this is great.

But suppose we are interested in designing a study to compare means between more than

two groups, these computations only allow us to do two groups at a time.

So for example, you wish to compare the average length of stay for

preventable diabetes hospitalizations across three insurance groups.

Government, private and uninsured for diabetes patients in the State of

Maryland in 2013 and you plan to sample equal numbers from each of these groups.

And based on the data from another that's done a similar study,

you have the following estimates and you want to do this study you'll see how in

part the result in Maryland compare to other states.

So, you have an estimate of the mean length of stay for

diabetes patients on government insurance of 4.2 days.

For private insurance, you're estimated mean length of stay is 3.1 days.

And in the uninsured group, it's 2.5 days.

And the estimated standard deviations for the three groups for

the individual length to stay values these three groups are similar at four days for

each, so that we can start with that.

So, how can a study be designed with 80% power to detect

differences between the three groups?

Well, one possibility is to do the sample size computations for

each unique two groups comparison and

then take the maximum number necessary across the three computations.

In that way, you'll be covered with a minimum of 80% power for

all three combinations.

So, let me just show you this based on the software.

The sample size needed to have 80% power in the government versus private group.

Is 208 in each group.

For the government versus uninsured and

this is the greatest anticipated difference, so it's the easiest to see.

We only need 87 subjects in each group.

But for the private versus no insurance, the difference in anticipated mean

like the stayed was much smaller than the other two comparisons.

And so,

this is going to be our driving factor in the overall sample size computation to see

the difference between those two groups based on the mean estimates we have.

We would actually need 698 persons in each group.

Subsizely larger than the other two comparisons.

But if we're really interested in being able to tech differences,

if they exist in this magnitude with 80% power for

each of these three comparisons, then the conservative thing to do would

be to actually do a study where we take 690 people from each of the 3 groups.

Now that certainly means that for some of the comparisons,

the first two, our power will be greater than 80%.

Because we have more than we need to see the difference for those at 80%, but

we'll be covered at a minimum of 80% for each of the three group comparisons.

So that, that was important to us.

This is what we have to do.

So in summary, when designing a study to compare means from two or

more populations, a researcher must have some estimate of the mean and

standard deviation of the values in each populations.

The sample size necessary to achieve a desired power to detect a minimum

detectable difference is a function of the difference.

The variability of the individual values in each group, standard deviation and

the desired power.

As you can see, I'd just laid out the idea of how to present, just an example of how

to present a sample size computation portion of a method section for a brand.

It's prudent to actually show the necessary computations under a couple

different scenarios both for the anticipated minimal detectable difference

of interest and for the estimated standard deviations in both groups being compared.

And then you can use that to sort of come up with an idea for

a sample size in each of the two groups being compare or three, or

more groups that would cover all of the scenarios played out in the table.

And the funding agency can make a decision about whether it wants to

cover all the scenarios you've suggested or only some of them based on what it

thinks is important in terms of minimum detectable difference.

In the next section, we'll show how to do the same thing.

But for comparing binary outcomes or proportions between two or

more populations.

So what I hope you take home from this lecture section as well as the next one is

the role that the detectable difference of interest and

the variability of the data for continuous data, and the desired power have on

getting the necessary sample size to have a study with the desired level of power.

What each of these can do to either increase or

decrease the necessary sample size.