A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

來自 Johns Hopkins University 的課程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

238 個評分

Johns Hopkins University

238 個評分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

從本節課中

Module 4A: Making Group Comparisons: The Hypothesis Testing Approach

Module 4A shows a complimentary approach to confidence intervals when comparing a summary measure between two populations via two samples; statistical hypothesis testing. This module will cover some of the most used statistical tests including the t-test for means, chi-squared test for proportions and log-rank test for time-to-event outcomes.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So in this section, we're going to look at the

Hypothesis Testing approach to comparing means between two unpaired populations.

Or two independent populations using data

from two independent samples from these populations.

So in this lecture set, we will learn how to estimate and interpret a

p-value or hypothesis test of a mean

difference between two populations for the unpaired or

two independent groups study design.

And the method for getting this p-value is called

the unpaired t-test or sometimes called the two-sample t-test.

And even though the word unpaired is not

mentioned in that second name, the two-sample t-test.

The unpaired is implied in that title.

So the unpaired part in the unpaired t-test way of saying this is

because of the study design when we're comparing

independent samples from independent populations that are not paired.

And again the t-test becomes from the fact that sometimes the sampling distribution

of the mean differences across multiple

random studies is described by a t-distribution.

So let's to kick this off, let's look at some familiar examples.

And we'll look at first the hospital length

of stay data by age of first claim from the

Heritage Health data that we've looked at in previous sections.

And remember this compares the distribution of the length of stay in

2011 for patients whose first diagnosis was made when they were over 40.

Compared to patients whose first diagnosis was made when they were 40 or younger.

And we've already seen we know the mean difference in

hospital length of stay for the older age of diagnosis

to the younger age of diagnosis difference was 2.2 days,

with a confidence interval of 2.05 days to 2.35 days.

So, those who were older at their first diagnosis had higher mean length of stay.

And that result was ruled out zero as a possibility.

All possibilities

for the true mean difference pointed to a higher number of

days on average for those who were older at first diagnosis.

So let's take the hypothesis testing approach now to get

a p-value to go along with what we've done this far.

This is going to look almost identical to what we did with the paired situation.

The only difference is how we estimate the standard error of the mean difference.

And we've already covered that really.

So, hopefully this will look somewhat familiar conceptually.

What we're going to do is specify our two competing hypotheses

which began for comparing distributions of continuous data between two populations.

The null is the mean.

Of what we're measuring for the two populations is

the same versus the alternative that the means are different.

And of course this can be rephrased

or represented in terms of the mean difference.

Which is convenient because we estimate

the mean difference to quantify the association.

So the mean, the null hypothesis expressed as the

mean difference is, is the difference in mean, hospital length

of stays for the older than 40 group compared

to the younger or equal to 40 group is 0.

Versus the alternative that this difference is not 0.

So, the way we start is exactly the same as with all

hypothesis tests that we'll show. We assume the null is true.

Then we figure out how far our estimated mean difference is from

that truth of zero under the null.

And even if the truth were zero, we wouldn't expect our sample mean difference

to be exactly equal to zero, but we'd expect it to be relatively close.

So we measure that distance.

And the way we do that is we take what we observed, mean difference

of 2.2 days and then divide it

by the estimated standard error of the difference.

Which because this is an unpaired situation,

we can't reduce our two samples to a single sample of differences.

And base our standard error estimate on the variability

in that sample of differences in the sample size.

We actually have to add together the

uncertainty in both our estimated mean separately here.

And, but, when we do the math, the estimated

standard error of this mean difference is 0.075 days.

And so our result

is 2.2, so in terms of standard errors 2.2, divided

by 0.075 is actually very very large. 29.3.

It's

ridiculous really, it's off the charts. Isn't it?

Our result is very far from what we'd expect.

To have happened just by chance under the null hypothesis.

So what we're going to do now though to quantify how far it is is translate this

distance into a p-value again by comparing the

distribution of such differences because of sampling variability.

When the samples we're looking at come from populations

with the same mean. Or a mean difference of zero.

And then we'll compare this p-value to our preset rejection

level, again of 0.05, which is what the research world uses.

So we have a result that is 29.3 standard errors

below the expected mean difference of zero, under the null hypothesis.

How, how likely is this to occur just by chance?

If the null is the truth.

Well, what we need to do is go well of

course the computer would do this for us but just

the idea is we would go and look at, the

expected behavior of our estimates around, assume truth of zero.

Which is what we'd assume under the null hypothesis, and we know that

that's sampling behavior of the mean

difference estimates is normally distributed around zero.

And we have a result

that I might not even be able to fit on the screen.

It's, it's, this is not drawn to scale.

This 29.3 standard errors above what we'd expect.

So, what our p-value would measure is the proportion of results we would get.

Or in other words, the probability of getting a result that

was 29.3 or more standard errors away from zero in either direction.

And we know this is way

off in the tails of the normal distribution.

And if we converted this to a p-value comes in very, very, very small.

Less than 0.0001.

So our results are not very likely to have occurred just

by chance if the null hypothesis is the underlying truth.

If the populations have a true mean difference

in hospital length of stay of zero days.

So this means the p-value

measures the likelihood, the

p-value interpretation is,

the probability of getting

a sample mean difference

estimate of 2.2 days.

Or something.

Even, more

extreme or less

likely than that,

is less than

0.0001 if these

two samples, came

from populations

with the same mean,

length of stays. In other words, if the null is true.

So we need to make a decision.

Well I thi, I think it's pretty clear and

of course we've already seen from the confidence intervals.

Then our decision would be to reject the null.

The p-value here is way less than our threshold of 0.05,

and so we would reject the null in favor of the alternative.

To say our results are not consistent with the null is

the truth so we believe the truth is that the means

are not equal.

And how does this jibe with what we make,

the decision we make from the 95% confidence interval?

Well, we saw that the confidence interval did not include zero.

So

we ruled out 0, or no difference, as a possibility.

As well from the confidence interval of virtual.

So these both concur in terms of their decision about a

generic version of the underlying truth whether it's zero or not.

And we'd expect them to concur because they're

using the same exact information to make this decision.

And again with this pair, unpaired situation, just like with

the pair, the p-values invariant to the direction of the comparison.

If we instead presented the estimate in the hypothesis in terms of the under

40 group compared to the over 40 group, the null hypothesis would be the same.

The mean length stay population level

are the same, where the difference is zero.

But we measure it, in our sample, in the opposite

direction, so we get a mean difference of negative 2.2.

That those under 40 had, average length of stay, of 2.2 days

less than those who were over 40 at the age of first diagnosis.

And so this is instead of being 29.3

standard errors above what we expected, it's 29.3

standard errors below.

And because we're considering with all our p-value computations the

likelihood of being as far or farther than our sample result.

In either direction from the mean, doesn't matter whether we

measure it as above or change the directional measures below.

We'll get the same p-value. So now let's

look at another example.

Low carbohydrate is compared with a low-fat diet in severe obesity.

And this is the study where they took a 132 severely

obese subjects and randomized them to one of two diet groups.

Either a low carb or a low fat diet group.

And the subjects were followed for a six month period.

And what the researchers actually say directly in

their abstract is, subjects on the low carbohydrate

diet lost more weight than those on the low fat diet.

95% confidence interval for the difference in weight loss

between the groups of negative 1.6 to negative 6.2 kilograms.

And the corresponding p-value less than 0.01.

So

let's look at how they got that result. So here are the data from the two groups.

Get the low-carb group.

There were 64 individuals in the low-carb group.

What they did for each of the 64 people is measure

their weight after the diet to their weight before the diet.

Took the difference and then averaged those across

the 64 people, on an average, people lost weight.

The average difference was negative 5.7 kilograms.

They did the same thing for the low fat group.

Took the after diet weight, compared it to the before diet weight.

So, took that difference for the 60 people, and took the average.

And this group also showed, at least in this study, some weight

loss in the order for 1.8 kilograms since that difference was negative.

And we have the standard deviation of the individual changes.

For the 64 people in the low-carb group, and the

39 people in the low-fat group.

So let's just do some quick computations here.

I'm not even going to write it out, we'll just have it pre-typed here.

So the mean difference in weight change for

the low-carb compared to the low-fat group was the

negative 5.7 we saw on the low-carb minus the negative 1.8 we saw on the low fat.

So the difference in the weight changes was negative 3.9 kilograms

for the low carb, for the low fat group.

Which means that on average the low-carb group

lost 3.9 kilograms more than the low fat factor.

And the estimated standard error of this difference we just plug in the formulae

and I'll let you take a look at that to remind you what it is.

But it turns out the standard error of the difference is 1.17 kilograms.

So if we were to create a confidence interval

for this we'd take our observe difference and mean weight changes of negative 3.9.

And then subtract two estimated standard errors.

When all the dust settles, with a little rounding, we get

the confidence interval that runs from negative 6.2 kilograms to negative 1.6.

So all possibilities for the true mean

difference in weight change show a lower value.

Or more weight loss in the low-carb group

on average compared to the low fat group.

But let's suppose we wanted to get a p-value.

Now we know the p-value will come in less

than 0.05 since our confidence interval did not include zero.

But let's go ahead and get a p-value.

So we set up our two competing hypotheses.

The null, this is the mean weight change for the two diet groups.

Is the same, the population level or

that the difference in weight change is zero.

And what we do is we measure our, our observed weight change difference

was negative 3.9 kilograms on average for the low-carb minus the low fat group.

The standard error is 1.17 kilograms.

And so we had a result that was 3.3 standard errors

below what we'd expect the difference to be under the null hypothesis.

So now we're going to translate the distance into a p-value, by comparing

it to the distribution of such

estimated differences because of sampling variability.

When the population level mean difference is zero.

And when then we compare the p-value to the

preset rejection level or alpha level, which, again is 0.05.

So to get the p-value, we say we have a result

that is 3.3 standard errors below the expected mean difference of zero.

How likely is this to occur just by chance?

If the null hypothesis is true.

So again, we get the p-value we'll again, lean on a computer to do it.

But just the idea is to

take a look at the, theoretical sampling distribution.

We could estimate characteristics of it.

We know if the null, is assumed to be true then, the true mean difference

is assumed to be zero.

But we'd expect the estimates to vary somewhat around that, we

want to see whether ours is amongst the majority or outside of them.

And so we got a result that was 3.3 standard errors below zero.

We want to figure out what proportion of results we could've

gotten for that far, farther away in either direction from zero.

To get an estimated probability of getting our study results

or something even less likely if the null hypothesis is true.

If in fact these samples came from

populations with the same mean weight change.

And the resulting p-value is very small, less than 0.01.

So, that says that if these samples came

from populations with the same average weight change.

Then the likelihood of getting the sample mean difference of negative 3.3

or something even less likely, is less than one in a hundred.

So the chances of getting these results are small,

if this data came from populations with equal means.

If the null is true.

And so the decision we make, we compare 0.01 to our preset rejection level for

calling things likely or unlikely.

It's less than 0.05, and we'd reject the null in favor of the alternative.

The alternative.

And of course, we knew this was coming, because our 95% confidence interval

for the difference in population means did not include that null value of zero.

Let's take a look at one last example, this is the

Menu Labeling and Calorie Intake thing we looked at before several times.

Where participants in a study were randomized to three different groups.

At a restaurant.

Those that got a menu without calorie labels,

those who got a menu with calorie labels.

And those that got a menu with calorie labels plus extra

information stating the recommended daily caloric intake for an average adult.

And this is a summary visually and numerically of the results.

Then we've looked at this previously. So, here were the resulting mean

differences in 95% confidence intervals between each two way combination curves.

And we saw essentially, there was no real

or statistical difference between in the calories consumed.

between those who got no calorie labels and those who got calorie labels.

The average difference was five calories with a

confidence interval that was almost symmetric about zero.

But when we started comparing each of those groups

to the group that got calorie labels plus nutrition info.

There were larger differences that were statistically significant.

So our confidence intervals for those differences didn't include zero.

So now, if we went back and calculated

the p-value, we can get a sense of how each of these would compare to 0.05.

Here are the actual p-values.

So, the p-value here for testing the null.

That the true mean difference in calories consumed between the no-calorie-labeled

population and the population with that we get calorie labels is zero.

The p-value testing that is 0.96.

So it indicates

that if our data came from, our sample

data came from populations with the same mean.

Calories consumed and the likelihood of getting

our sample mean difference is very high.

And of course this is way greater than 0.05.

And the decision we would make, is we would fail to reject the null.

Let's stick with that language for now.

If we failed to rule it out.

And we already knew that would happen because

our confidence interval for the difference includes zero.

For these other two comparisons the no calorie labels

and the calorie label groups that were so similar here.

And when we compare each of them to the group that got calorie labels plus info.

The differences are both statistically significant.

The p-values are

both less than 0.05.

Prompting us to reject the null of no difference of the population level.

We knew that was coming again because,

the confidence intervals for these differences did not include zero.

So, just a couple more pieces to

talk about here with unpaired studies and results.

for smaller samples like slight corrections need to be made to the number

of estimated standard errors added and subtracted to get 95%

coverage. And for evaluating or getting a p-value.

But this is what a computer will handle.

I want you to just understand, big picture, how to interpret this, so

we won't worry about having to look up these situations with small sample sizes.

The end result in terms of interpreting the confidence interval

and the p-values are the same, regardless of sample size.

>> There are some alternatives to the c sample

t test that may be appropriate with very small samples.

Where both samples have less than or equal to 10, observations in them maybe.

the idea is that when really small samples, everything breaks down

including potentially that t distribution assumption if we have really skewed data.

Because our sample mean can vary wildly because it

can be heavily influenced by an outlie on the sample.

So what these alternative approaches do is they don't

use the data as measured, but they pull the

data from both samples.

Or to the values across both samples from smallest to largest in assigned ranks.

For example, the lowest value in the data set gets a rank of one.

The second lowest value gets a rank of two.

Once the ranks are assigned then the data are put

back into the original group and the mean ranks are compared.

And so this t, this approach is robust

to the effects of outliers because with outlying

values won't change the relative rank of an observation.

But it can really affect the summary measure like the mean.

So the names of the tests that you'll sometimes see come

up that use this approach include for pair comparison, the ranks.

Some tests and for unpaired comparisons,

something called the Wilcoxon Signed Rank Test.

I find these tests somewhat unsatisfying however because they

don't have a confidence interval component that goes along

with them.

So, you only get a p-value that doesn't give you any substantive

insight as to how big or important the differences in your data are.

Another thing I want to point out to you, just FYI, for your information.

Is the test I am showing you with the mechanics for computing the standard error

as we do, is formerly call the two sample t-test assuming unequal

population variances. Remember variance is just standard

deviation of individual values in the population, or a sample squared.

So then another way of rephrasing this test is

that the two sample t-test assuming unequal population standard deviations.

The traditional t-test, sometimes referred to as the two sample

t-test assuming, equal population variances, assumes

equal variances and hence standard deviations.

In the two populations being compared by the two samples.

In traditional statistics, classes what you'd be taught to do first is do a

hypothesis test to test whe, the null

that the two populations have equal standard deviations.

And you might say well why can't I just

compare the observed values in the sample standard variances

well again, those are just estimates from underlying truths

and there are errors in those estimates as well.

So you could do a formal hypothesis test to decide, make a decision about whether

your two samples came from populations with

the same or different individual levels of variation.

But the test that does this is,

is known to be very non robust and not work well.

Secondly, it's kind of crazy to have to do a hypothesis test

to choose which hypothesis test you'll do, so we're going to reject that idea.

[LAUGH]

There's a slight modification to allow for unequal variance or standard deviations.

This modification adjust the degrees of freedom for

the test using slightly different standard error computation.

So in smaller samples, you'll get slightly different

t distributions depending on which test you use.

And in general, their standard error estimates will be slightly different.

Here's the deal, though. If you want to be truly safe.

If you could only use one t-test

for the rest of your life. It's more conservative and proper

and consistently correct to use the test that allows for the unequal variances.

And to just to formalize what I said before.

If the underlying population standard deviations are equal,

both approaches, the one assuming an equal standard deviations.

And the one assuming unequal, give valid confidence intervals.

But the intervals by the approach

assuming unequal standard deviations are slightly wider.

And p-value is slightly larger.

However, if the underlying population level

standard deviations are not equal, then the

approach that we didn't show, assuming equal

standard deviations does not work that well.

So, the, the one that works well all the

time is the one that assumes unequal standard deviations.

So in summary, the two pre ample, unpaired t-test, is the method for getting

a p-value for testing these competing hypothesis

here about equal or non equal means.

Using data from unpaired samples from two independent populations.

The mechanics, except for the standard error computation,

are exactly the same as the pair to t-test.

And the resulting decision we make, just like we saw before with the

paired t-test will concur with the results from

the 95% confidence interval for the difference in means.

If zero is in the interval, then the p-value will be greater than 0.05.

If zero is not in the interval, the p-value will be less than 0.05.

So how do we do this?

We set up the two competing hypotheses, then we assume

for the working purposes, the null of equal means is true.

Figure out how far our observed result, our sample mean difference

is, from the expected value of zero, in terms of standard errors.

We can estimate from the samples.

Then we translate this distance into a p-value and make a decision.

P is, it's traditionally and standard wise, if p is less

than 0.05, you reject the null in favor of the alternative.

P is greater than or equal to 0.05, we fail to reject the null.

And the p-value what it measures, is the

chance of getting the study results, or something

even less likely, when the samples are assumed

to have come from populations with the same means.

Chances is you getting our results if the null were true.