Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

來自 Johns Hopkins University 的課程

Mathematical Biostatistics Boot Camp 2

41 個評分

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

從本節課中

Discrete Data Settings

In this module, we'll discuss testing in discrete data settings. This includes the famous Fisher's exact test, as well as the many forms of tests for contingency table data. You'll learn the famous observed minus expected squared over the expected formula, that is broadly applicable.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

Okay here's a, here's another example and we're going to, we're going to do.

so this is from Rise's book Mathematical Statistics and Data Analysis, which

is another first of all, I have affiliation, and never met Rise.

So it's easier for me to say this, but, I, I, love this book.

I think it's wonderful, this Mathematical Statistics and Data Analysis book.

So, if you are looking for a book recommendation, I like that one.

in addition, I really like to address this both, and

I'm willing to stipulate my conflict of interest in recommending it.

but I do really like it, I read it all the time.

so any rate, in this book he there's he had this interesting example.

Where he had a bunch of words words taken from some novels, that were well

one of them was purported, were two of them were known to be Jane Austin novels.

And one of them was, was in questions as to

whether or not it was from written by that author.

Let's say they found it later.

And

this maybe other ways you would want to analyse this data for this reason,

but you know we want to use it as an an example for the Chi-Squared.

So, don't think too hard about

specifically how you would analyze this data,

because I, I doubt this would be what you would arrive at immediately.

But, it, it's not unsensible by the way, it's, it's, reasonable.

so

so imagine, let's say, book three. I'm just spit balling here.

Imagine book three it's, it's book where you don't whether or not

it's from the same author, in this case it was Jane Austin.

and you want to test whether the word distribution of these words, is equivalent

across the three books and you sampled so many words from each of the books.

Okay?

So that's the setting and let's see if we can figure

out some, some expected cell counts to do a Chi-Square test.

Okay, so our null hypothesis is that the distribution

of these words is the same for every book, and

that at least two are different.

So in this case, we have kind of

a multanomial for every column that we're interested in.

We want to test equivalence of

those multinomial probabilities across the columns.

okay.

so, if the book was irrelevant, we would look at these

row margins and we'd say the word a should appear roughly 434 over 1017 times.

Kay? So our 1,1 cell was

147 but in that book, there were

375, you know, words of this type that were sampled and we're multiplying them

times the, what we would expect to see of that word, if book were relevant.

And that's 434 over 1,017, one thousand seventeen,

and that's how you get that expected cell count.

Follow through, take the observed

12 cell.

now that's how many time a appeared in book two.

It co, it came up one 186 times.

How many will we expect out of 440 words?

from our estimate viewing book is irrelevant, that

would be 434 over 117 times 440 and you

would compare that then to 186, to see

how much the observed count deviates the expected count.

Now then you would move on to the next word, an.

Right?

So an was disregarding books seen 62 out

of 1017 words. and so we would take 375 and

multiply it times 62, times, divided by 1017, and

compare that to 25, to see if the number of an's in book one,

was different than what we'd expect and so on.

And you can go through all the calculations and the

sum of the observed minus expected squared over expected is 12.27.

Degrees of freedom are 6 minus 1 times 3 minus 1.

In this case, working out to be 10. now again

and, and, and now at this point, I would ask you,

as an exercise, to figure out what the chi squared probability is.

chi squared p-value is based on 10 degrees of freedom using r.

now, I should make some comments about sampling assumptions and these sorts

of things about how the reasonableness of modelling the words this way.

so, you know, but this, this goes without saying in all of these problems.

we're making assumptions on how the experiment was

conducted and that the model we're applying is relevant.

So in this case, you could probably say, yeah.

That's not so bad, about sampling assumptions,

if you under, if you were to make some, you know you

know, assume that there's, you know, there's lots of words in the book.

you know?

We're only sampling, you know, 300 or so of these very generic

words so that we could, you know, so that modelling that process is multinomial.

Yeah. That doesn't seem so bad to me, at least.

assuming that's how the experiment was conducted.

But you always want to think about that.

How does what I'm doing, so the Chi-Squared test, how does

its assumptions accurately reflect the way in which the experiment was conducted?

You know, anyone can of course just take a

table, click some buttons and get a p value.

But that p value is supposed to represent a pro-, probability.

That probability is supposed to you know quantify

the randomness in the experiment.

And that randomness is only you know, that, that is only

being modelled, through,or that is only

being quantified through this statistical model.

And so our results are only meaningful insofar

as that model is a meaningful reflection of reality.

Okay. So here is kind of a funny one.

I also got this from Agresti's book.

Both these books have lots of wonderful discussions of contingency table stuff.

So in this case, they we're rating couples.

A husband and wife's ratings of sexual fun.

So, N was never, F was fairly often,

V was very often, A was almost always.

and so let's say they sampled 91 couples and then they cross-classified them.

And you know let's let's talk about

in this case, you know, kind of the logical way, kin, the, the logical, a

logical first question to ask would be, you

know, are the ratings independent of one another?

Is the wife's rating independent of the husband's rating and so on.

And and see you, you can make all

the relevant jokes about the experiment at home right now while we move on.

Okay, so now in this point, we're going to

talk about that the ratings between the row

variable, which was husband's ratings and the column

variable, which was wife's ratings, that they're independent.

Versus the alternative that they're not independent.

So let's talk about what happens under independence.

The probability that a husband rated n and the wife

rated a would, would factor in the probability that the

husband rated n times, the probability that the wife rated a.

Okay?

So, again, we don't have these probabilities,

we're going to have to estimate them.

But for the Chi-Squared test, we're going to estimate them under the null

hypothesis and compare them to what we observe in the cell counts.

So let, let me just do one of them.

so the first

col row probability, is 19 over

91 because if you disregard wife's ratings, the husband rated n 19

total times out of 91. and the wife rated

n 12 times out of 91, if you disregard the husband's ratings.

and so our expected counts under independence, would the

probability of that specific cell, 19 over 91 times 12 over 91.

Which we can multiply these, because we made

the assumption of independence under the null hypothesis.

Now we multiply that times the potential, the number 91, the total

number of couples, to get 2.51, we're going to compare that to 7.

When you do that throughout and you get the

expected counts are this formula right here.

If you, if you like, the, the, the row total times the column total,

then divide it by n instead of n squared, because you multiply by n.

so that e-ij is the cell you copy for every cell, but the logic is clear

as to where that formula comes from. And the degrees of freedom, are again rows

minus 1 times columns minus 1.

And I'm going to to let you again now, I'm going to do even less, and

I'm going to let you calculate the statistic

and compare it to the Chi-Squared distribution.

I would note, this is what, I find this fascinating, so I keep repeating it.

But it is interesting that in all of the cases, all of the cases that we

covered, you could just execute the formula by, or execute the Chi-Squared

test by using this formula, the eij equals ni plus times n plus j over n.

You wind up in every case with the

identical Chi-Squared statistic and I suggest you try it.

Go back to some of these other

examples and calculate the Chi-Squared statistic that way.

so any rate and that's why I often, in textbooks,

the Chi-Squared statistic is only presented as this formula right here.

but again I, you know, even though the statistics

stays the same, the interpretation of the results depends

on the design of the experiments and which margins

they are fixed by the design, and so on.

So, I think that's an important caveat.

But none the less, in terms of actually doing

this stuff, if you have to program it, for example.

you can always just use this simple formula, not have

to really spend too much time not thinking about, oh yeah,

what's fixed and what's not while you're doing the calculation's.

Of course you want to think about

that while you're doing the interpretation, but while

you're doing the calculations, you just use this

formula right here, which is extremely convenient then.

Oh and apparently I lied when I said, I'm going to let you do

this on your own, because here on the next slide I do it.

so, I define a 2 a 4 by 4 matrix here.

the x matrix you just use Chi-Squared, chisq.test(x), that will give it to you.

Again, remember doing the continuity correction.

So if you do it by hand you probably won't get exactly the same numbers.

if you do this formula,

observe minus expected squared over expected, you get around 17.

Degrees of freedom are 3 squared, and the p value works out to be just under 5 %.

so couple caveats.

I should, I should say a couple caveats.

One is, the Chi-Squared approximation is an

asymptotic approximation, using the central limit theorem.

Now, it may not be clear how the central

limit theorem is kicking in here, but it is.

And so, you have to worry about whether or

not the central limit theorem is a good approximation.

But fortunately for you, in a slide or two, we're going to

talk about how you can do exact Monte Carlo finite sample approximations.

So you, you should get excited about that now.

so, so,

often in textbooks you'll see them talk about whether the

cell counts are large enough to use, a large sample approximation.

My recommendation is just to always use the small sample one.

In chisq.test, you can do an exact equals true

argument and just get the exact small sample test.

which, you know, then gets rid of the need to do that.

And for the test of independence, you can do it without

without actually

too much computing.

You have to have a pretty big table, to not be able to do it.

so for a more elaborate Chi-Square test, you can't do the exact ones.

I have an r-package called Exact LogLin

Test, which does kind of crazier distributions.

but you know, it's not maybe the most trivial r-package to use.

and then there's the software called StatExact, where they do

quite a few exact versions of contingency table tests.

So any way, but all the Chi-Squared tests, in terms of

comparing them to the Chi-Squared distribution, that is an asymptotic test.

It relies on the central limit theorem.

You are you know, you can, you can use these

rules that say, the cell counts have to be so large.

But in reality, as with all Chi-squared test, with

all asymptotic approximations, you're putting your faith in the

idea that the, the the cell count, the,

the asymptotic's have kicked in on your behalf.

but the, you know, checking that the cell counts are large is

a, is a way to give yourself some hope that, that's true.

I had one other point I wanted to, to make.

so let's see, I think I made the point that the that the test, all the tests if

we've used this last Chi-Squared independence formula, we would've

gotten the same Chi-Squared statistic for all the tests.

In every case the degrees of freedom, is

always rows minus 1, times columns minus 1.

oh, last thing, yeah, now I remember what I wanted to say.

So where

in the world does this observed minus expected

squared over expected, where does that statistic come from?

It actually comes from the Poisson distribution.

and the, it turns out that the, you know, the expected

value, the expected, are, are, the sort of expected cell counts.

And it turns out in the Poisson, the the expected

value is the variance, as well of a Poisson random variable.

So you can kind of think of this as each o is a Poisson

count minus it's mean and then divided by

the standard deviation and if you square all that.

Right?

Then you wind up with this statistic, o minus e squared over e.

so you might think of this guy, the, the, the o minus e squared over

e, one of the elements of the sum, as being like a little z statistic.

And then, you're adding up a bunch of z statistics, squared.

which e, a z statistic squared is a Chi-Squared

1, so when you're adding them up, you get a Chi-Squared.

now because we're estimating components of the expected

counts, then we lose degrees of freedom that way.

And if you want to do a careful accounting of how the

asymptotic work, works, then you have to a, account for it.

But that's where this formula comes from. It comes from the Poisson distribution.

and it really, you think of it as a bunch of squared

z statistics.

and of course, the asymptotic's are a little bit more delicate

than that, but, but honestly not that much more delicate than that.

So that's where that formula comes from.

Okay, so just to rehash these equal distribution

tests, like we did for the word counts.

That yields the same thing as the

independence tests, they're, they're all the same tests.

And, you know, under the same, under, under kind of the,

the, the sort of similar modeling

assumptions I'm sorry, similar testing assumptions.

You know, if, if your model's kind of binomial or multinomial where the row

total's fixed, binomial or multinomial where the column total's fixed.

The total sample size is fixed, and

you assume a multinomial and test independence.

None are fixed and you do a Poisson model

and you test some form of row versus column additivity.

All these wind up in exactly the same Chi-Squared

statistic the same p-value, the same rejector not resolved.

so and if this bothers you, basically the best

thing I, you know, I tried, I think throughout

the lecture to, to describe this on numerous occasions.

But if it's still bothering you and gnawing at you

what I would say is, this is really common in statistics.

Mathematically equivalent results, are applied in different settings.

They have different interpretations, but the

actual statistic, the mathematical results are equivalent.

So that's how I like to, to rationalize it to

myself, that all of these things are coincidentally the same.

So, some final comments on the asymptotic test.

The Chi-Square result require is an asymptotic test, so

it requires that, the something be going to infinity.

you know, in the multinomial, the multinomial case.

It's the, overall sample size has to go to infinity.

If you have, you know, multinomial columns, then I all of the

samp, all of the column totals have to be going to infinity.

And there's various strategies

for checking whether or not this, you, you,

you're close enough for the asymptotic's to be reasonable.

but we'll talk about an exact test, that will

get rid of the need to think about that.

The degrees of freedom are always rows minus 1, columns minus 1.

And this, what we'll talk about now is

the generalizations of Fisher's exact test can be used.

or, you know, this other thing, the continuity corrections

can be used that will make the asymptotic approximations

accurate, even for relatively small sample size type problems.

So let me let me show you how you can actually

use Monte Carlo to calculate an exact P-value for contingency tables.

Imagine if we got the, the individual data points.

Not just the contingency table, but the individual data points.

So for the first couple, it was nn. Where is the second couple it was nn.

For the third couple it was nn, for the fourth couple it was nn and so on.

And then oh, here's a couple that was fn and so on.

And here I've clearly sorted them in some way.

But this is the raw data.

If you were to, if you were to then take this data and create the counts of the

number of nn's, fn's, vn's, an's and so on, ff's, af's and so on.

You, you would get exactly the contingency table in the

from the couple of slides ago. And here's the interesting fact.

So this is the, the husband's data and the wife's data.

So, think about it.

If, if they were independent then the matching the pairs of husband wife's would

be irrelevant whether or not, you know,

you were talking about this specific couple.

Or you just had a husband and a wife their answers

should, it should matter whether you line them up

together with their correct husband and wife, or not.

So, so what you can do, is either take the, the, the wife,

row here I have, or the husband row here.

Either way or both but that would be unnecessary and then just permute it.

And then what you get is a realization

of the contingency table, where we've operated under the

assumption that, you know, which particular member of

the couple, the pairing of the couple is irrelevant.

In other words, that husband and

wives are, are independent.

But notice if we were to permute that, that data

and, and reconstruct the contingency table.

We, we would still have the same

number of husbands answering n and husbands answering

f, wives answering n, wives answering f, wives

answering b, wives answering a and so on.

So in other words, this procedure would constrain the margins, but would permute

the interior of the table, which is exactly what Fisher's exact test did.

Right?

And this is exactly the same thing as the Monte Carlo

version of Fisher's exact test.

It's just a general, you know, with more possible outcome

values for the, for the row and the column variables.

So you do this and you re-calculate the contingency table.

You calculate the Chi-Squared statistic for each permutation and the percentage of

times it's larger than the observed value, is a so called exact P-value.

And in R it's pretty easy to do this, because you do chisq.test

(x), simulate.p.value = TRUE). And then it does this exact Monte Carlo

version of the test, which is really, which is really really neat.

So this is just a generalization of Fisher's exact test.

and yeah, it's a very nifty little result and it has a lot of intuition to it.

Right?

It, it makes sense to think that, well, under the null hypothesis

of independence, we should be able to permute the couple.

They, you know, permute which spec-, specific pair

it was and, and recalculate the contingency table.

And we should get roughly the same you

know, discrepancy between the observed and the expected counts.

notice, you know?

I said use the Chi-Squared statistic.

So we're using the Chi-Squared statistic, but we're

not, we have an exact small sample P-value.

This P-value is,

is valid regardless of the size of the data.

So of course then it tends to be a little bit conservative.

but it's, so it's it's using the

Chi-Squared statistic but it's not using the

asymptotic, chi the central limit theorem to

compare your statistic to the Chi-Squared distribution.

I would say there's other choices for the test statistic and

because the way in which we're calculating our, our null distribution

is through this permutation process.

We could use whatever statistic we want right here.

And so, you know, the Chi-Squared distribution, the

Chi-Squared statistic is not necessarily a bad choice.

So any way, this is an interesting way to get an exact P-value for contingency table

test, where you're interested in looking at things

like independence between the rows and the columns.