A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

來自 Johns Hopkins University 的課程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

238 個評分

Johns Hopkins University

238 個評分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

從本節課中

Module 3B: Sampling Variability and Confidence Intervals

The concepts from the previous module (3A) will be extended create 95% CIs for group comparison measures (mean differences, risk differences, etc..) based on the results from a single study.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

In this set of lectures, we're going

to look at estimating confidence intervals that

compare means of continuous outcomes between two

populations using samples from the two populations.

So in this set of lectures, you will learn how to estimate and interpret 95%

confidence intervals for a mean difference between

two populations under two types of study designs.

One called paired, when the two samples

drawn from the populations under study are linked systematically

and hence the populations under study are linked systematically.

And then unpaired, when the two samples

are drawn from two independent unlinked populations.

And we'll give examples of both of these.

the general idea is exactly the same, the only thing that's

different in the terms of computing confidence intervals is how we estimate

the standard error of the difference in the sample means.

So let's look at it, first, we'll look at

a couple parity, comparisons to get the ball rolling here.

So this is an interesting study about clinical reproducibility.

this was back in dated in the eighties when AIDS first became an

issue in medicine and public health and they were trying to figure out

ways to diagnose patients. And so this study was done where two

different physicians assessed the palp, palpable lymph nodes in 65 randomly

selected male sexual contacts of man with AIDS or AIDS related condition.

And so, what they had here were 65 men who met the criteria,

ostensibly a sample from the population of such men and they were shown or

seen by each doctor.

And each doctor tracked the number of palpable lymph

nodes that he or she could find on the patient.

And so you can see here that the average number of lymph nodes that Doctor 1 found

on these 65 patients was nearly eight, 7.91 compared to 5.16 for Doctor 2.

So if we compare Doctor 1 to Doctor 2, we can arbitrarily make the difference

in either direction.

But for here, we chose to make it Doctor 2 minus Doctor 1.

Doctor 2 found on average 2.75 fewer nodes than Doctor 1.

So let's look at this data structure and figure out what this standard deviation

of 2.83 for the differences refers to. So what we had in the study is Doctor

1 and Doctor 2 and a bunch of patients.

So when each patient came to the clinic, he was seen separately by both doctors.

And Doctor 1 would count the number of palpable nodes

that he found, he or she found on the patient.

So if we going to have for this patient maybe Doctor 1 found seven nodes,

and Doctor 2, going against the trend in the means we saw, found ten.

So if we took the difference, between Doctor 2

and Doctor 1 for this patient, three more nodes

were discovered by Doctor 2 than Doctor 1 on the same patient.

Next patient comes in and perhaps Doctor 1

finds seven nodes and Doctor 2 finds five nodes.

And the difference in the number of nodes

found, Doctor 2 to Doctor 1 is negative two.

And when all is said and done, even though there's two samples

of measurements, those done by Doctor 1 and those done by Doctor

2, since they're all done on the same patients.

We can reduce those two columns of data to a single sample

of differences on each patient found by Doctor 2 and Doctor 1.

And so if we took the mean of

these differences, the negative 2.75 we saw before.

If we took the standard deviation of these

65 differences, the 2.83 that we reported before.

So now we've got a simple situation that we know how to handle.

We've got one sample of data where each data point is the difference.

And we can estimate a confidence interval for the mean difference in the population

of all men with a sexual partners of men with AIDS or AIDS related conditions.

Where they seen by these two doctors. And so it's business as usual.

We'll take the observed mean difference as negative

2.75 and add subtract two standard errors.

So this would be found by taking the standard deviation of the

individual differences, 65 patients and divide it by the square root of 65.

And if you do this out, check my math, you get a confidence interval of negative

3.45 nodes to negative 2.05 nodes. So,

Doctor 2, on average, found lower or lesser number of nodes then Doctor 1.

And even after accounting for the uncertainty estimate because they only

looked at 65 men from an entire population of such men.

We get an estimate for the mean difference and the number of nodes

they would've gotten had they looked at all such men in the population.

And if so systematic lower

measurements on average by Doctor 2 even after accounting for the uncertainty.

So how can we interpret this confidence interval?

Just to reiterate, had all such men, been ex, in the population

from which the 65 were taken been examined by these two physicians.

The average difference in the number of lymph nodes discovered by

the two physicians would be between negative 3.45 and negative 2.05.

And notice all possibilities for this true mean difference show

a lower number on average, discovered by Doctor 2 compared to Doctor 1.

They're all negative and zero is not included in the interval.

Just FYI, the direction I chose to present the comparison with is arbitrary.

If we instead done the comparison in the comparison of Doctor

1 versus Doctor 2, then our observed difference would be the opposite.

Instead, negative 2.75 it would be 2.75 because, well, whereas before

we were saying Doctor 2 had, 2.75 nodes less on average than Doctor 1.

That's the same as saying Doctor 1, found 2.75 nodes more on average than Doctor 2.

And the confidence interval would be the same values just the opposite sign.

So, think about the

finding here.

What we found is a difference that couldn't be explained by sampling

variability and the fact that zero was not included, in an interval.

All possibilities showed, a mean difference.

A real mean difference between Doctors 2 and 1.

So, this result is not about these two doctors per se, what

it, what it shows that this, da, this diagnostic approach is not reproducible

across different examiners.

We only looked at two, and they showed a difference in their results.

And it was outside the bounds of what we'd expect by a random sampling variability.

So this had implications for coming up

with better diagnostic procedures for diagnosing AIDS.

Here's

another example, a small study done about cereal and cholesterol.

Here 14 males with high cholesterol level were given, Oat

Bran cereal as part of their diet for two weeks.

And then corn Flakes cereal as part of a diet for two weeks.

And here are the results.

At the end of the two week period, the average cholesterol level in these

14 men after having two weeks of corn flakes was 171.2 milligrams per deciliter.

At the end of the two weeks on oat bran, the resulting

average cholesterol level was 157.8 milligrams per deciliter.

And this difference, if I did it in the direction of corn flakes minus oat bran,

was 13.4 milligrams a deciliter. But there was a lot of variation.

And the individual differences or changes in cholesterol level reach the 14 men.

So again, this is a paired study design. We've got

the same 14 men selected, the same 14 men

selected from a population of patients with high cholesterol level.

And each man is given two weeks on each of the cereal diets, and

his cholesterol level is measured at the end of the two-week period on each diet.

So, for example, male number one, perhaps his cholesterol level

at the end of the corn flakes week was 165.

At the end of

the oat bran week it was 443. So his difference in the

corn flake week to the oat bran week was 22.

And yet hopefully you get the picture, but what

we'd end up with ultimately to analyze is 14 differences.

And we could get a mean difference, which

we showed, and a standard deviation of these differences.

So the mean difference was that 13.4 milligrams per deciliter and the standard

deviation in the 14 individual differences was 15.5.

So what we do here is, it's business as usual.

The only thing we have to pay attention to, and

we really wouldn't because hopefully we'd be using the computer.

But because this is a small sample, there's only 14 men involved.

We have to look to the t

distribution to assess how many standard errors we

need to add and subtract to get 95% coverage.

So if we did this,

times our estimated standard error, which would be that

15.5, the individual variation, or the square root of 14.

If I did the math correctly, you get a confidence interval of 4.5.

22.3 indicating that on average were we to give corn flakes for two weeks

to all men with high cholesterol level and then give them oat bran for two weeks.

The average difference in

end of two week cholesterol levels would be

such that the corn flake group would have

higher, consistently, anywhere from 4.5, milligrams per deciliter

to 22.3 milligrams per deciliter higher on average.

So, there's no zero in this interval.

So even after accounting for the uncertainty, in our small sample

estimates we see evidence of a real difference at the average level.

But there's some uncertainty in our understanding of

that difference which is why this interval is wide.

Let's get another classic kind of study that, that is done usually as a pilot.

It's not very, it doesn't hold up well.

Not for the statistics but because of

other reasons that we'll discuss in a minute.

But this is before ver-, versus after study.

Sometimes this is done to get some data on the table, so to speak,

for doing a larger better designed study.

There is this ten study looked at ten non-pregnant pre-menopausal women

16-49 years old who are beginning a regimen of oral contraceptive use.

And they had their blood pressures measured prior to

starting oral contraceptive use And three months after consistent use.

The goal of this small study was to see what if any changes

in average blood pressure were associated with oral contraceptive use in such women.

The data on the following slide shows

the resulting pre and post-oral contraceptive use.

Cystolic blood pressure measurements through the ten women in the study.

So here's what the data looks like.

So, for example, this first woman, when she came in, before

she started using oral contraceptives, her systolic blood pressure was measured.

It was

a 115 millimeters of mercury. After three months it was measured again.

After three months of oral contraceptive use

was measured in 128 millimeters of mercury.

And so this woman experienced an increase of 13 millimeters per

mercury after her three months on oral contraceptives compared to before.

So again, this data is paired because each woman is

her own comparison.

Before and after using the intervention of oral contraceptives.

And so we could reduce this data to a sample of ten differences.

The after blood pressure measurements minus the before for each woman.

And if we did so and check the mean of these difference.

You can see that most of the differences are positive.

So the average here was 4.8 millimeters of mercury at average increase

of 4.8 millimeters of mercury after being on oral contraceptives.

And the standard deviation of these

individual differences was 4.6 millimeters of mercury.

So if we actually did the work and created a confidence

interval using a computer now, but the idea is exactly the same.

We take our observed mean difference of 4.8 and add and subtract not two.

But slightly more than two standard errors because we

have a smaller sample, we get a confidence interval.

For 1.5 millimeters of mercury to 8.1 millimeters of mercury.

So again, in this situation we do not include zero in our interval.

And it suggests, that even though we have a small sample, that

the increase we saw was not a fluke of random sampling error.

That's consistent with a real increase associated with oral contraceptive use.

In the population of women that were sampled for this study.

Now from a sustenance perspective, it might be harder to interpret

because perhaps a clinician would say an average shift of 1.5 millimeters

of mercury isn't that interesting.

However on the other hand, if the average were to go up by

over 8 millimeters of mercury, that would be something to really be concerned about.

So from a substance perspective, there's still maybe some ambiguity about how

much association there is between oral contraceptive use and blood pressure.

But from a statistical perspective, there's no ambiguity

in the sense that we've ruled out the possibility that there's no association.

Because zero is not in that interval.

There any interpretation issues?

Suppose somebody said, well this clearly

proves that oral contraceptives increase blood pressure.

Is that a conclusion that can be made with this type of study?

Well let's think about it.

We, we only looked at women who were

on oral contraceptives, and there's a potential confounder

here, which is time.

We don't know what else happened in that time period

that could have affected the change in blood pressure.

For example, maybe the weather got cold.

That went from spring from fall to spring,

autumn to spring, and people stopped exercising outside.

And they were getting less exercise and their blood pressure went up.

Maybe that, maybe there was what we might call temporal confounding.

That the exposure here was associated with other things associated with the time

at which the measurements.

The time gap in which the measurements were taken.

So we can't necessarily conclude strongly that oral contraceptives are the driver.

What would have made this a stronger study?

Well, if they had had a control group of similar women, maybe who were randomized

to not go on oral contraceptives. Now that may be tricky to oper,

operationalize, because women generally go on oral contraceptives

for specific reasons.

And it may be unethical to deny those who need them.

Oral contraceptives for the purpose of the study.

But it's just something you think about in the limited conclusions we can make here.

So well, let's segway into the unpaired situation.

and here we're going to do it's, it's exactly the same idea.

We're going to look at the mean differences we see

in our data and try put uncertainty bounds on them.

But the data structure

is slightly different and that's going to effect how

we estimate our uncertainty in our difference in means.

How we estimate the standard error.

So there's this Heritage Health data, remember we looked at the age

of the assocation between the age of first claim and length of stay in the hospital.

And we saw that those who were greater than

40 years old, when they made their first claim in

the plan had an average day of 4.9 days versus 2.7 amongst those who were younger.

And about a quarter of the sample, 3770 of the over 12000.

Were greater than 40 when they had their first claim versus the

remaining three-quarters who were less than 40, less than or equal to 40.

And

the question you might have is we see an, an increase like

the study for those who were older when they submitted their first claim.

But does that hold up after we account

for the uncertainty in these sample mean estimates.

Well let's just think about the data structure we have here.

Here we have two groups for, for the most part are not systematically linked.

One is much larger than the other which is the first

clue that these things aren't paired.

Because in order to be paired, you'd have to have the same sample size.

That's a necessary but not sufficient criteria for pairing.

But in this case we have very different sample

sizes, and ostensibly the people who are less than 40.

There may be,

there may be a few family members that slip in here.

But the people are less than or equal to 40.

There's no systematic connection between each person in this data set

and a person, a person's in the group greater than 40.

That these functionally are two independent groups that are not connected.

But we're still going to quantify the association the same way.

Take the mean difference between the two groups.

And we

see in this data that those who are over 40, have an average length of

stay 2.2 days greater than those who are less than or are equal to 40.

Now we just need to be able to put on certain

[INAUDIBLE]

on this and get a confidence interval.

So we need to estimate the standard error of this mean difference.

And the data is not paired, so we can't take these two

samples of information and reduce them to a single samples of differences.

How are we going to estimate the standard error?

Well it turns out there is a pretty useful formula that will help us do it.

And we'll, we eluded to this in the first

section, but here's the estimated version that we can get from our data.

What we do is the standard error, or uncertainty associated with the difference

in these means is an additive function of the standard error of each mean.

I'll write out the formula and then we'll play with it to show that.

So, what we do is we take the standard deviation

of the individual values in the first group, square it.

Divide by the sample size in the first group and add

that to the standard deviation of the values in the second group.

Divided by the number of values in the second group.

Another way to write this just for fun, is you could write this as

S1 over the square root of M1 squared

plus S2 over the square root of

n2 squared. Which can also be represented

as the standard error of

the sample mean for the first group

squared plus the standard error of the sample mean of the second group squared.

So, in other words, the standard error of the difference in these sample

means is an additive function of the standard error of each sample mean.

Something to think about that we can talk about in bulletin board

and live talk, is, why, if we're taking a different sample means.

Why is the standard error additive?

So if we were to do this with these data,

and put this in.

What we get is our estimated standard error.

So I'll write it for the x bar greater than 40.

Sorry for that awful greater than sign. Minus x bar less than or

equal to 40. What do you call the standard deviation of

the values in the greater than 40 group squared divided

by the number of persons.

And do the same thing, standard deviation squared.

The values in the great less than or equal to

40 group divided by the number of person in the group.

And if you do this out, and do the math, it equals, it's equal to about 0.075 days.

So it's small.

Less than a tenth of a day and it's small mostly because our sample

sizes are so large.

So if we go ahead and put this in,

and do a confidence interval, we take that observed difference.

This is again comparing greater than 40 to less than 40.

We could do it in the opposite direction,

and the results would just be the opposite sign.

2 times 0.075.

And if you do this out, we have a confidence interval that

goes from 2.35, 2.05 to 2.35 days. So, we observe an average

length of stay of 2.2 days greater for those who had their first claim after 40.

Compared to those who had their first claim at 40 or less.

But that was an estimate based on admittedly large sample.

Now we've added in the uncertainty, and we

have a pretty tight confidence interval that suggests.

That shows that there is a real difference on average at the

population level, and it's on the order of more than two days.

Here's another example we looked at.

Menu labeling and calorie intake And you may recall

this study was a randomized study with three different groups.

the group that that everybody was given a restaurant meal.

And, they were allowed to order from a menu.

But the first group was randomized to receive a

menu without any information on it, calorie labels or otherwise.

The second group was given a menu that had the calorie labels of

all the food items.

And the third group was given a menu with calorie labels.

And a label stating that the recommended, stating

the recommended daily intake, for an average adult.

So this group was called calorie labels, calorie labels plus information.

And I'm just showing you that table we looked at before from

the article, and I'm pulling some

information from additional tables in the article.

And, so this

shows in the I'm going to focus on the total calories consumed during and after

the meal. But, for the no calorie labels group, that

was an average of 1,630 calories consumed among the 95 people in that group.

There was a lot of variation in the

amount of calories consumed between these 95 people.

810 calories was the standard deviation.

In the group that got the calorie labels and nothing

else, there was a very similar amount consumed on average.

And the group that got calorie labels

plus info, the average amount consumed was 1,380.

Again with a fair amount of variation in the individual calories consumed.

There are 106 people, there are 106 people

randomized to this group.

So on this table here, I'm just going to show you the results

that we could've done by hand. We have all the information to do it.

And if you're anxious to try it out these methods and when I

get to them, you can certainly verify the results I give you here.

But more interesting, perhaps, is to interpret the results.

So here are the resulting mean differences in 95% confidence intervals.

If you want to verify my math, you can go ahead and do so.

You have all the summary statistics you need to do this.

But let's focus

on the results and the interpretation.

So if we were, so we've got three groups here.

Obviously our method only compares two groups at a time.

But we can construct the difference and confidence

interval each two way combination of the three groups.

So if we looked at the no calorie labels

group less the calorie label, that difference was five calories.

Only five more calories consumed on average

in the no calorie labels group compared

to the group that had calorie labels.

95% confidence interval on that is negative 216.7 to 226.7.

Lot of variation, but centered around zero.

Basically indicative of no evidence of a association,

good or bad, between the no-calorie label group relative to the calorie

label group.

The end result is statistically equivalent, because

we've got 0 in that confidence interval.

If we were to compare the no calorie label group

to the calorie label plus information no, it's a different story.

That difference that we observed in the study was 250 calories more on

average consumed by the group, with no labels compared to the group that got

the labels plus info. However and we do the confidence interval.

And you can see this confidence interval shows were we

to give this as if you will, to the population.

Randomizing to one of these two groups of all such people who were selected,

from the population which they were selected.

We would see a consistent increase in calories consumed in

the group that didn't get calorie labels to those who did.

But there's some uncertainty, and so this confidence interval doesn't include zero.

So we can say that, in this study, they found a positive association between not

getting calorie labels, and information. And increased consumption on anywhere

from 45.3 more calories on average to over 400.

So from a subsidence perspective it's a little harder to digest.

Certainly in the world of public health, we'd be

more interested the larger this mean difference was in reality.

But there's too much uncertainty to hone in on it with a lot of precision.

But on the whole, we can say that having calorie labels plus info is effective

in reducing average calorie intake.

If we compare the group that got calorie

labels to the group got calorie labels plus info.

the difference here is 245 calories, almost exactly the same as the previous

one, because the no calorie labels and calorie labels groups were so similar.

and this confidence interval is similarly wide and doesn't include zero.

So, what can we deduce from this?

Well, from this study we can pretty much deduce that calorie labels alone have no

impact on calorie consumption by those individual restaurant going population.

However, if the calorie labels are embedded with information about

how much an average adult should con-, consume in a day.

That seems to have a strong impact

on reducing the average amount of compl, calories consumed.

So let's just talk about unpaired studies and results for a minute.

This last example we looked at is from a randomized trial.

As such we can sort of cleanly consume, conclude that

the resulting differences are because

of the investigator allocated interfemt, intervention.

The differences we saw between the calor, the no-calorie label

group and the group that got calorie labels with info.

Because people randomized to those

groups, there shouldn't be other factors associated with those

groups that could explain that difference in calories consumed.

And conversely, the fact that we show

no difference between the no calorie label group

and the calorie label alone group is indicative

of the calorie labels alone having no effect.

Because patie, subjects were randomized to those groups

it's not likely explained by systematic differences between

those groups. Besides the no label versus labels issue.

But in non-randomized comparisons, like the one we did with the Harrod shelf plan.

We'll get into this more with some more examples in the exercises and homeworks.

The interpretations will have to be done with the knowledge that other

factors may be the reason for

an association or difference, or no association.

In this

point, that's the best we'll be able to do.

In the second term, we'll show how to, to, re-estimate the

differences accounting for other factors that could influence that difference.

One more thing, and I don't want to highlight this anymore than I,

I pointed this out initially when we were dealing with single sample means.

This is a computer issue, this is not one to worry about computationally.

But for smaller samples, slight corrections need

to be made to the number of

standard errors added and subtracted to get

95% coverage with the, unpaired comparisons as well.

And the, the drill is something like this.

We, for smaller samples, we can think of if the

sum of the sample sizes is less than 60. you would need to go to a t-distribution.

With n1 plus n2 minus two degrees of freedom

and pick out the value that gives 95% coverage.

But this is something that any computer package will handle.

The important thing is the general idea.

That we have to add in and subtract uncertainty from our estimate to

get an interval that reflects our uncertainty

and how do we interpret that interval.

So, in summary

95%, and I should add confidence

intervals, can relatively easily be estimated between

for mean differences between two populations

for both paired and unpaired study designs.

The resulting 95% confidence intervals are interpretable.

As a range of plausible values for the

true difference in population means for the population from

which the samples were taken.

The confidence intervals allow for one to

ascertain whether there is a real, non-zero difference

between the populations being compared after accounting

for sampling variability in our sample mean estimates.

As with everything in research, the statistical results

have to be translated into scientific or substantive terms.

And this includes considering aspects of the study design.

So we'll continue to talk about that.

But just to, one final note, just the

operational difference between these two approaches is very minimal.

In either case, we're looking at a mean difference between the two groups.

The mean if one group minus the other plus or minus

two estimated standard errors, two, two

estimated standard errors of this mean difference.

It's only how we compute that standard error,

the difference between study design, prepared study designs.

We just take the standard deviations across the pairs.

Divide by the number of pairs.

For the unpaired study design we have to be a little fancier.

We have to use the information from each sample

separately to combine it into the standard error estimate.