Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

來自 Johns Hopkins University 的課程

Mathematical Biostatistics Boot Camp 2

44 個評分

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

從本節課中

Techniques

This module is a bit of a hodge podge of important techniques. It includes methods for discrete matched pairs data as well as some classical non-parametric methods.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

Okay, so let's talk about exact inference odds ratios.

This is the last thing I'll talk about in this lecture.

let's let X be the, the number of smokers for the

cases, and Y be the number of smokers for the controls.

and remember in this case X and Y are

the random numbers because we're thinking of case reference sampling.

So X and Y are the, the, are the random numbers.

The 709 margins are fixed, and we're going to

assume that both of them are say binomial.

and we want to calculate an exact confidence interval for the odds ratio.

Not an approximate one, so the square root,

one over the cell counts formulas an approximate

one.

And I'll show you that you have to eliminate a

nuisance parameter, and I'll show you how to do that.

So let's define the logit function as the log of the odds.

So logit p is log p over 1 minus p, that's the logit function.

So notice the differences in the logits are log odds ratio, so if you logit

P1 minus logit P2, that's the log odds ratio for P1 to P2.

So as an example, logit P so let's, let's define

the, the, the logit of the probability of being a smoker given you're case as delta.

And, and by the way this implies that the probability

that you're a smoker given the case, given a case is

e to the delta over one plus e to the

delta, so if you invert the logit function you get that.

The logit of the probability of being a smoker given

that you're a control, let's call that delta plus theta.

So it's a different number for describing

it relative to delta but because we're not constraining

theta, it can still be any number but theta, okay?

then we get probability of being a smoker given a control works out to be

e to the delta plus theta divided by one plus e to the delta plus theta.

In this case theta works out to be the log-odds ratio.

Okay, so actually just of course it's the log-odds ratio because

if we subtracted to logits, the delta cancels out and we get theta.

So theta's the log-odds ratio.

And, and in the way that we've parameterized this,

delta, this other parameter is this so called nuisance parameter.

We don't care about that.

What we care about is the log-odds ratio comparing smokers to case status.

Okay, so let's keep working on the the model here.

So here we're going to

assume that x is binomial with n one trials, and then

we already stipulated that the probability that the logit is delta.

So the probability is either the delta over 1 plus

e to the delta, then y is binomial with n2

trials, and success probability e to the delta plus theta

divided by 1 plus e to the delta plus theta.

So then our probability x is, is the capital X takes on realized

value little x It's going to be this binomial probability and you can

kind of work with to get to where it's this formula right here.

N1 choose X, etcetera.

Then, so this is just carrying this over from the previous slide.

this is the property X takes on realized value little x.

And then I'm going to look at the probability

Y takes on realized value z minus x.

And you'll, you'll hopefully see why in a minute.

And that's just plugging directly into the binomial formula.

And again I have z minus x right

here, instead of a particular value say little y.

Okay, now,

the, the probability that X plus Y, the random variable X plus Y,

takes on realized value z is a little bit harder to calculate.

Because X and Y are not identically distributed.

If they were identically distributed, then it would, x would be the sum of a

bunch of Bernouli trials, y would be the sum of a bunch of Bernouli trials.

So x plus y would be the sum of a bunch of

[UNKNOWN]

Bernouli trials.

So they're still both the sum of a

bunch of Bernouli trials, but not the same, okay.

So here's what we can do. We can factor this into, let, let's

suppose that we decompose z into u and z minus u,

u part's going into x, and z minus u part's going into y.

The net probability would be this, this product right here.

Probability X is u, and Y is z minus u.

And so the probability X plus Y takes on the value z, is going to be the sum over

all the possible values of u. in other words, all the different ways we

could allocate some of it, some of the some of the elements of z to x,

then what.

Whatever we can allocate the remaining to Y.

Okay so, so that's a quick little formula you can do.

Okay, now we're going to get to the point.

So now let's look at the probability of X takes on a particular value

x given that the sum X plus Y takes on a particular value z.

And I'm just going to plug in these three lines up here, right?

So the probability X takes

on value x is going to be the probability this, this numerator

probability right here, probability x equals x, probability y equals z minus x.

and so

just to elaborate on that point. the probability X takes on value x and X

plus Y takes on value z is this, the and

probability, then given, because we're stipulating that X is value x.

That's the same thing as the probability Y takes on the value z

minus little x, and then we can factor those probabilities into the product.

So that's the numerator right here,

and then the denominator I'm just plugging directly in

the probability X plus Y takes some value z.

Okay, so then you put it all in, just you, you know,

if, if you can follow the mathematics, hopefully you follow the mathematics.

if you can't, if you're having trouble with this,

because I realize it's a little bit in depth.

then if you plug it all in, and you wind up with this formula right here.

And, and again this,

you know, this is very similar to our development of Fisher's exact test.

Only difference is now, we haven't assumed the null hypothesis to be true.

And so what we have here is, is the, it depends on theta, this log odds ratio.

Okay.

So here but, but notice it doesn't depend on delta, right?

So we've gotten rid of delta.

And that's this idea of conditioning away the nuisance parameter.

Here, conditioning on x plus y, it conditions away the nuisance parameter.

So, but nonetheless, now we have a distribution odd for our two variables.

Because remember if I know, I don't need to talk about the x and y if

I have conditioned on x plus y, if I know x,

then I know y, given that I know X plus Y.

Okay.

So

So, so the, the, you can use this distribution to calculate the

exact hypothesis test for theta equal to theta nought, other than 0.

The specific case 0 results in Fischer's

exact test, the ordinary hyper geometric distribution.

and then you could invert these tests to

yield exact confidence intervals for the odds ratio.

And that is exactly what R does if you do

fisher.test, it'll give you a confidence interval for the odds ratio.

It is

exactly doing this

this procedure right here.

It's inverting the so-called dis-distribution here,

which is called the non-central hypergeometric distribution.

and it, so we're not going to go through any calculations with this,

because as you can tell, at this point, it's gotten rather involved.

But I did just want to show everyone

where these exact odds ratio calculations come from.

They basically come form this formulation of the problem as a non-central

hypergeometric distribution.

So what I'm hoping you got from today's

lecture though was a little bit of information about

the odds ratio, about some of its more

general purpose uses, for example, in case control studies.

And then also now to talk about a little bit about where some

of the more complex formulas for performing

inference on the odds ratio come from.