Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

來自 Johns Hopkins University 的課程

Mathematical Biostatistics Boot Camp 2

41 個評分

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

從本節課中

Techniques

This module is a bit of a hodge podge of important techniques. It includes methods for discrete matched pairs data as well as some classical non-parametric methods.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

Okay, so let's go through our example.

Our density estimate is, I don't have a density.

Our, our d estimate of the difference in

the marginal proportions works out to be 0.04.

I you know, you can plug in to the sigma hat d squared formula here.

It works out to be about 0.0 0.95. and for the sigma,

not sigma d squared and then the confidence interval you can

do right here. It's plus or minus the standard error the

difference plus or minus two standard errors we get about 0.06 to 0.02.

And notice what happens though, if you ignore the dependence.

And just chop off this co-variance term here and

forget about it.

Then you wind up with a much significantly inflated a standard error.

Sigma d works out to be about 0.0175.

So there is kind of an interesting, relationship between the

Cochran-Mantel-Haenszel test and the match 2 by 2 table.

So imagine if you took each pair and represented their time, first

and second, and gave their respon-, responses, yes or no.

and so we can really think of this as an extremely stratified setting, right?

Where every strata just has the two measurements first, second.

and, and

if you, if you do that then there's only

four possible tables, time, first, second, response, yes, no.

We got a 1 1 0 0, 1 0 0 1, 0 1 1 0, and then 0 0 and a 1 1 like that.

Okay?

So imagine if you represented the table like represented all the tables like

this, and I hope you can agree with me. That this would exactly reproduce the

two by two table.

If you knew all these tables, you would exactly reproduce the two by two table.

So here's a, kind of a famous old result.

That the McNemar's test is equivalent to the Cochran–Mantel–Haenszel test.

Where the subject is the stratifying variable, and each 2 by 2

table is the observed 2 by 2 table from the previous slide.

So you could almost think it's And again, I put

here that this representation is only interesting for conceptual purposes.

but I think you get, it is interesting to note that

you can really view the subject, or if you're doing matched pairs.

You really think of that as an incredibly stratified circumstance where

there's, you know, only two you know, two counts per table.

Then, and, and analyze the data that way with the Cochran–Mantel–Haenszel test.

You wind up with exactly the same test as McNemar's test.

It's just kind of

a conceptually neat idea that, you know, is

I don't know, kind of a fun little fact.

Another fun little fact is that McNemar's test has an exact version.

And so consider the, only the off-diagonal cells, the dis-coordinated cells.

And then under the null hypothesis.

pi 1 2 over pi 1 2 plus pi 2 1 is 0.5, right?

just look back at the null hypoethesis. If the two are equal,

then 1 over the sum would be 0.5. Okay.

then it turns out, also, under H 0, that n 2

1, given the sum, right? So n 2 1 or n 1 2 given the

sum, is binomial with success probability 0.5 and n 1 plus

n 2 trials.

so you can exactly use this to come up

with an exact p value for, for matched pairs data.

basically, what we're doing is saying, under the null hypothesis, whether,

that the, the, the two off-diagonal probabilities are identical.

Whether you landed in the upper right hand cell or the

lower left hand cell is a coin flipped, for every matched pair.

And we would have evidence against the null, if a

lot more wind up in one of those two cells.

Okay, and so this is an example of, it's, it's actually a highly related

to the, and we'll cover this as well, the so-called non-parametric sine test.

And what, what you're saying is kind of under the null hypothesis.

you know, things should be exchangeable, whether they agree

in terms of approving and disapproving or disapproving and approving.

related to our approval example.

and so let's go, let's actually work out an example.

Okay, so here we want to test

H 0 that pi 2 1 equals pi 1 2 versus H a, pi 2 1 less than pi 1 2.

And I put in parentheses that this is pi 1 plus, less than pi plus 1.

pi plus 1 is less than pi 1 plus. So pi plus 1, is the approval at time

2 and pi 1 plus is the approval at time 1 disregarding time 2.

Okay.

So this is testing whether or not the approval at

time 2 is lower than the approval at time 1.

Okay, so that's the, the direction that the margin is looking at.

So we saw 86 people in,

that disapproved on the first sur, survey.

And approved on the second survey, the n 2 1 cell.

And, we want to test whether or not that's

smaller than what, what would we expected by chance.

And the probability of getting data as or more extreme in favor

of the alternatives, so probably X is less than or equal to 86.

And then

because we're doing the exact version, we'll condition

on the total's sample size, the 86 plus 150.

The number of off-diagonal counts and, will use

a binomial with a success probability of 0.5.

And that probability is about 0.

So we reject the null hypothesis.

This, for a two sided test, just double the smaller the one sided test.

For the purposes of this class you know if you do it in R it'll, it'll maybe

do a slightly better procedure. Given that.

Okay I want to cover another thing that,

that's often omitted when discussing these things.

the marginal odds ratio would be the odds of approval at the

first. com-, I'm sorry, the, the comparison

of the odds approval at time 1 relative to the odds of approval at time 2.

So here I put time 1 in the numerator of the

odds ratio, and time 2 in the denominator of the odds ratio.

So what I have are the,

the odds of approval at time 1 at the top, versus the

odds, divided by the odds of approval at time 2 in the denominator.

So that is a margin, it's a marginal

odds ratio, because these are all marginal probabilities.

Right.

And that is of interest in exactly the same way

the, the difference in the marginal probabilities is, is of interest.

so but it's a different setting, right?

It's a different setting than if we just sampled some people

at time 1 and a different set of people at time 2.

And we could assume they're independent.

These are exactly the same two people sampled

twice, so we need to To account for that.

At any rate just like the ordinary odds ratio,

the way that we conduct the odds ratio confidence interval.

Marginal odds ratio confidence interval, was first we, we first

calculate directly the marginal log odds ratio.

It's given by theta hat here.

And then the stand the, the variance of that estimate or

hence the, you square root it to get the standard error.

Is given by this guy right here, where you put

hats over everything and estimate them with the relevant sample proportions.

In order to get the estimated standard error.

And so, you can use that to create a confidence interval

for the marginal log odds ratio, when you have matched paired

data, matched 2 by 2 data.

Okay.

So in the approval rating the marginal odds ratio compared to the odds

of approval at time 1 to the odds of approval at time 2.

The log odds ratio works out to be 0.16. The standard error works out to be 0.039.

And then the constant interval for the log odds

ratio then will be 0.16 plus two standard errors.

It gives you this right here, about 0.084 to 0.236.

You want to compare these to

0, because it's all in the log scale.

And then exponentiated if you want the confidence interval for

the marginal odds ratio rather than the marginal log odds ratio.

Okay, I want to do cover something that always

comes up when I teach this class in person.

because several people will have seen a different formula

for the odds ratio for 2 by 2 tables.

And so I want to cover the one that they see.

And there's

a difference.

One of them is a conditional odds ratio, and the other's a marginal odds ratio.

So imagine if we created a logit model for our approval rating data.

Where we say the logit, the probability that person I

says yes at time 1 is alpha plus U i.

And a logit of person, the probability that person I says

yes at time 2, is alpha plus gamma plus U i.

So U i is this person specific effect.

Alpha is common across both times. And gamma is the log odds ratio comparing

the approval rating for given person at time 2 to time 1, right?

So notice you have to compare the same person,

because otherwise these U i's would not cancel out.

When you took the difference in these two logits.

So each U i contains a person-specific effect.

So the person with large

U i is likely to answer yes at both occasions.

A person with small or negative U i is likely to answer no at both occasions.

So then gamma here is the log odds ratio of comparing a yes at time 1 to a response

of yes at time 2 And in this case, gamma is a subject specific effect.

you, you only interpret gamma if in fact these U i's cancel out.

And that's where you

get the conditional this so called conditional formula for the odds ratio.

So one way to eliminate U i is to do a so called conditional estimate, estimator.

And the condition on the total number of yes responses for each person.

so, so what you wind up with is only looking at the discordant cells again.

And then the conditional ML estimator for this log odds ratio and its standard

error turn out to be the log of the ratio of the off-diagonal counts.

And the standard error turns out to be

the square root of 1 over the off-diagonal counts.

So I think people prefer this because it's a

simpler formula, but notice it has a very different interpretation.

In one case we were comparing

the marginal probabilities.

In the other case we had this formulation where we had these person

specific, random effects that had to cancel out, in other words one of them.

averaged across people, and then the other one conditioned on people.

So they have different interpretations, one is called

the marginal one is called a marginal odds ratio.

And it's confidence interval, and this one is called subject specific odds ratio.

And its confidence interval. So they have different interpretations.

The difference in interpretation is extremely subtle.

But it still exists. and that's why you get different answers.

so let me just summarize here. The marginal ML has

a marginal interpretation. And the, the effect

is averaged over all these U i values, if you want to put it back to the same model.

Okay.

The conditional ML estimate has a subject-specific interpretation.

And so, you know, if you ask me when would

you want to use one versus the other?

I kind of think if you were talking about kind of policy type questions.

Then you would want marginal statements.

and then if you want kind of clinical

type questions, then you probably want subject-specific type statements.

But it's, it's, you know, it's not perfectly clear.

but nonetheless, that's where the difference come from.

It's, it's the fact that basically the, the logit

is not a linear function.

And so, you get a difference between sort of averaging over people and then.

creating odds ratios, or creating odds ratios then averaging odds ratios.

You just get different answers.

And so that's the difference between those two.

I think it's a, it's a very subtle thing, and I

think for the purposes of this class, you can ignore it.

I just wanted to present it in case you were among

the subset of people, that happen to see this formula, log

n 2 1 over n 1 2 plus the standard error.

That the reason that it's different is

because we're kind of taking a different approach.

And the reason I do the marginal approach,

especially in this class, is because we talked

about everything related to 2 by 2, match

2 by 2 tables that we discuss, is marginal.

So we talk about McNemar's test, the exact version of McNemar's test.

And then

the marginal odds ratio.

Everything is related to the marginal probability.

So if you're okay with that, then just leave it.

but if you're not okay with that.

And you need to know, why is this different from the formula that

you saw before, perhaps in an EPI class or in another bio-stat class.

That's the reason it's a different formulation

[SOUND].