Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

Loading...

來自 Johns Hopkins University 的課程

Bioconductor for Genomic Data Science

168 個評分

Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

從本節課中

Week One

The class will cover how to install and use Bioconductor software. We will discuss common data structures, including ExpressionSets, SummarizedExperiment and GRanges used across several types of analyses.

- Kasper Daniel Hansen, PhDAssistant Professor, Biostatistics and Genetic Medicine

Bloomberg School of Public Health

Now we're back with the second part on the use case and how to combine GRanges and

AnnotationHub.

As a quick reminder, we have gotten hold of our promoters.

Our promoters we'd say is a GRanges thing.

We have 50,000 promoters.

We have gotten hold of our histone modification peaks.

We have roughly 75,000 of those.

And we're now going to ask is this particular histone modification,

H3K4 trimethylation, enriched in promoters?

So, to do that, we want to use,

we basically want to see how often do they overlap?

And the workhorse for that is findoverlaps,

which is a function we'll use again and again.

So let's ask how many promoters all have a peak?

We get this overlap object out, but

I'm really mostly interested in how big a percentage of the promoters overlap.

And here,

as we have explained with findoverlaps, we have to think about double counting.

We can take the query hits, we can do a unique.

So let's see a length, let's first save the file overlaps.

So this is the number of peaks that overlap, length(unique) because

we don't want to double count queryHits(prompt).

This is the number of

promoters that have a peak in them, and

the number of peaks that have a promoter in them, we get with the subject hits.

I often do a little trick here.

I utilize that we have the convenient function called subsetByOverlaps,

where, say, out of my peaks, how many overlap a promoter?

And perhaps we want to ignore strand.

And then I say what's the length of that?

That's the same number as before.

And then I divide that by length of the peaks.

This is the percentage of peaks that overlap a promoter.

Now 30%?

Is that big or little?

It's not 100%.

It turns out if you know a little bit more about H3K4 trimethylation,

is that this histone mark doesn't just mark actual promoters.

It's also sometimes associated with enhancers and other regulatory elements.

So 30% is perhaps pretty good.

Is it?

Well, we have to do a little bit more analysis in order to answer that.

Let's first answer like how many promoters actually have a peak in them.

So that requires us to switch the promoter and the peaks.

And we have to divide by the length of the promoters.

And we get a number which is basically 50%.

That's pretty nice.

My rule of thumb, is that in any given cell types, 50% of genes are expressed.

And that fits pretty well with this 50% we see here.

It's probably a little bit of coincidence, but you

have to accept that in many cell types, a lot of genes are not being expressed.

So this system modification should only map actual promoters,

which is going to be a subset of all the promoters.

Now, is this big or not?

Let's look a little bit about the actual, let's look at how big are the peaks and

how big are the promoters we have and how big is the intersection.

So, let's do a, let's say how many paces does the peaks really call?

We ignore strand in this reduce.

And I want to have it in megabases.

So 11 megabases.

How many megabases does the promoters call?

62 megabases.

And, how big is the overlap?

So, here we take the intersection of the peaks,

and the promoters, and we ignore the strand again.

And we say sum, width, and divide it by a million.

That seems to be wrong.

I'm not dividing by a million, I am dividing by 106.

3 megabases.

So all of this is like small parts of the genome.

11 megabases, 62 megabases, 3 megabases.

But again, is it big?

So, by the way, when I'm doing this intersect here between peaks and

promoters I don't have to call reduce on it,

because I know that the output of intersect is a normal R ranges.

So I don't get anything out of calling reduce on it.

Now, in order to do this, let's quantify the relationship between these things.

And we've got to do a little back of the envelope calculation that's useful.

But I wouldn't say it's a perfect

statistical answer to the question of whether or not the overlap is significant.

We've got to basically say that we've got 3 billion bases in the human genome.

And a base can either be both in a peak and in a promoter, or

it can be in a peak and not in a promoter and vice versa, and

we're going to look for is there some kind of significant enrichment here.

We're not really going to do significance.

We end up with a two by two table.

But we are going to look at the art range who has the strength of the association.

Now, let's construct our matrix.

It's going to be a 2 by 2 matrix that's filled up with zeroes.

And we're going to have the colnames and

the rownames they're going to be in and out,

those are colnames, and the same thing for the rownames.

So now we have the matrix and ready to fill out.

So in in, is going to be the number of bases that

That are part of both of the things.

So that's really the intersect,

the sum of the basis here we have in the intersect command.

We've gotta put that up in element 1,1.

So we're going to have peaks on the rows and promoters on the columns.

So now I'm going to take in,out, 1,2, and

this is going to be the bases that are in peaks but not in promoters.

So we can do in a setdiff.

Also it should be in the peaks, not the promoters.

Need to ignore the strand.

And this was as a Grange so we want to have the sum of the width, and again I'm

using that my setdiff is a set operation, so I get a normal R ranges out of it.

And we're going to do the same thing, but

we are going to reverse peaks and promoters.

So we can do the colSums of in, out, and

now we get the 62 million we dad before,

and if we do the row sums, We get the 11 millions we had before.

Now, we need to fill out the last thing, and that's really the number of bases

in the human genome that is not part of any of these things.

So that's really equal to, let's say the human genome is 3 billion bases,

And we are going to subtract sum of inOut.

Yeah, we will just subtract sum of inOut because we have a zero down in

the lower-right corner.

So we now have a 2 by 2 table and we can compute an out range for this table.

Normally you would be able to compute an out range using the Fisher

test function in R, which starts a Fisher exact test.

So the way you do this would be fisher.test of inOut.

And you want to get the component called the statistic.

I happen to know this.

But now because we have entries that sum up to the length of the human genome and

that's bigger than the biggest integer, we get an overflow problem.

So now we have to compute the odds ratio by hand.

Now, I always get confused about that so I have to look it up in Wikipedia.

But really it's equal to the product

of the Of the two columns divided by the,

so let's see if it came out.

So this is the diagonal, one diagonal

So the odds ratio is a number between zero and

infinity, and values greater than 1 means enrichment.

So this here means that The overlap between the peaks and

the promoters is like 18 fold more enriched than we would expect.

Now this is actually very driven by the size of the human genome.

So we just put in like a number which we said was 3 billion, and

you can argue, is that really the right number?

So this has to do with giving a good statistical answer to this entire thing.

It turns out that while the human genome has 3 billion bases,

some of these bases can't be mapped using short read technologies.

Some of the bases haven't even been sequenced.

So it's probably wrong to say that the mappable part of

the human genome is 3 billion.

[COUGH] We can do a sensitivity analysis and

see how much does this odds ratio change if we dramatically change our assumption.

So let's say that only half the human genome is mappable.

That's probably an underestimate.

So, let's set inOut, the 2,2 to be zero first,

and then we say inOut[2,2] = 1.5

times 10 to the 9- sum(inOut).

Okay, now we are back, but we have a much smaller genome.

[COUGH] Let's compute the same odds ratio.

And we can see that this really dramatically reduced the enrichment but

it's still much, much bigger than 1.

So while the size of the human genome on the mappable part of the human genome

definitely has some influence on what is the quote,

true odds ratio for this analysis, [COUGH],

it doesn't seem to change our conclusion that you have some amount of enrichment.

You would like to put a p-value on this doing a Fisher exact test like here,

turns out that because the numbers are big, everything is significant.

That's kind of a problem with Fisher exact test, when you use it on tables with very

big numbers, so this is for sure very significant.

But you can argue is this even the right analysis?

We are letting each base independently

either belong to the peak region or the promoter region, and

is that really the right model for this type of phenomena?

That's probably unlikely, but I would say that this back of the envelope calculation

we're doing here suggests that there is some enrichment.

It may not be that the odds is precise 8.9 or 18.3.

It may be a different number, but

there's probably some amount of enrichment happening here.