An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

來自 Johns Hopkins University 的課程

Statistics for Genomic Data Science

124 個評分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

從本節課中

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

I'm going to pick up where I left off in the last calculations.

So I've got these p values that I've calculated

from doing my multiple hypothesis test.

And then I want to correct them for multiple testing.

So the first one that we might want to consider is the Bonferroni correction, so

that's controlling the familywise error rate,

the probability of even one false positive.

So one way that we can calculate these sort of things is to

use the p.adjust function in R.

So I pass it the p values that I originally calculated.

And then I tell it to use the the Bonferroni method.

And what it's going to do is it's going to transform those p values, such that I can

now apply a threshold to the transformed p values, which look like this.

So you can see that they're mostly 1 because they've been transformed, and

I can basically look at the quantiles of that

distribution to be really clear about how large they are for the most part.

Then if I want to do Bonferroni control, I can basically say I'm only going to call

things significant at a family-wise error rate of 5% means I need to look for

all Bonferroni p values less than 0.05.

In this case, there are none, so there are no statistically significant results

at a Bonferroni corrected level, okay?

So then the other thing that I could do is I could do adjustment for

a false discovery rate.

So again we're controlling a different error rate here, but

I can use the same function p.adjust, and I can pass it the f stats,

object, and I can tell it method BH.

That's Benjamini-Hochberg,

which is one of the ways to control the false discovery rate.

So if I look at the adjusted p values, again, it's set up so that if I

call everything less than with an adjusted p value less than 0.05 significant,

it will control the false discovery rate at 5%, so

then I can look at the number of those that are less than 0.05.

In this case nothing is significant with either case, but

that's how you check to see what's statistically significant.

You could also do this quite easily

with the q value package to adjust from multiple or with the limma package.

So, if I want to do limma, I can just calculate the limma adjusted p values.

I'm using the top table function, so I'm going to pass it the previous

calculated from the previous lecture, Ebay's limma model fit.

And I'm going to tell it I want every single one of the adjusted p values out,

and then if I look for the adjusted .P.Val argument, got to get the caps right,

then you end up with adjusted p values for the limma model.

Okay? So the limma model here actually finds

some differentially express.

So there's two that are adjusted at a Bonferroni adjustment.

Not Bonferroni, Benjamini-Hochberg adjusted, so

this p value's the adjusted from the Benjamini-Hochberg.

You can also apply q value directly to control the false discovery rate.

So here if I pass the limma p values to the q value function,

then, I can look at that and

it'll tell me how many it identifies at different thresholds.

So here it's, at a p value threshold it finds 2, at a q value threshold of

0.05 it finds 2 as well, and so it basically tells you for different

calculations and different threshold levels, how many does it find significant?

The other thing it can tell you,

the nice thing about q value compared to some of the other approaches,

is that it will tell you the estimated fraction of null hypotheses.

In this case, it estimates that the prior probability, or

the prior number of fraction of null hypotheses, is one.

So that means that there is very little differential expression signature there.

You can also apply that to the edge object that we calculated earlier.

And you similarly get q values for the edge object, for

the likelihood ratio test.

So again here for the likelihood ratio test it's the unadjusted value.

You can use the ODP statistic for edge and get slightly more power.

But this is sort of the direct f statistic comparison for multiple testing.

In any of these cases, you're basically either calculating either a q value or

an adjusted p value, comparing it to the same usual threshold, and

identifying how many things are statistically significant, and

that will control the error rate for you.