Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

Loading...

來自 Johns Hopkins University 的課程

Bioconductor for Genomic Data Science

180 個評分

Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

從本節課中

Week One

The class will cover how to install and use Bioconductor software. We will discuss common data structures, including ExpressionSets, SummarizedExperiment and GRanges used across several types of analyses.

- Kasper Daniel Hansen, PhDAssistant Professor, Biostatistics and Genetic Medicine

Bloomberg School of Public Health

In this video, we will discuss basic usage of the IRanges package.

Lets load it.

And IRanges is a vector, that contains integer intervals.

Sounds a little funny, but let's take an example.

We construct an IRanges by using the IRanges constructor function,

and we give two out of three arguments, start, end, and with.

We only need two, because if we know two of them, the last one can be inferred.

So here we have a start, an end, a width, and

we can see the width column has been filled by knowing the start and the end.

Here we construct another IRange by specifying the start and the width, and

we get exactly the same object out.

We access the different components of the IRanges vector

using x as a functions that are named start, end, and width.

So for 7 start of IRange,

we get back a vector of the start positions of these different intervals.

We can also set the elements of the IRanges using these access functions.

In this case here we have resized the different ranges to have width 1,

and we can see we used the stock assist as the anchor point of the resizing.

There's a more flexible function that allows a more flexible resizing,

not surprisingly it's called resize and we'll discuss it briefly below.

IRanges can have names like any other vector and

because they are vectors, even though they look a little bit like matrices,

they don't have a dimension, they have a length.

And we subset them using a single bracket with a single index,

either an integer or the name exactly as we know it from any other big chunk.

We also can concatenate two IRanges that we can concatenate any vector.

A specific type of IRange that's very important because we encounter it

again and again in usage, is something called a normal IRanges, and

that's a little hard to explain at first, so let's plot them and see some examples.

So first, we evaluate a function here that allows us to plot these things.

We make sure we can plot two Things at the same time.

We get us an IRange and we plot it.

So here we have an IRange, and as you should've seen before, there's no

requirement that the different intervals inside the IRanges are non-overlapping.

We have two intervals on the left that are clearly overlapping.

So you can think of this for example as axons in the genome.

A normal IRanges is created by the reduce function and it's a minimal representation

of the original IRanges as a set, so what do I mean by that?

Well, I mean that each integer that belongs to one or

more of the original ranges belongs to a single range.

Furthermore, the ranges are as big as they can be.

If you look to the right of the picture, the two ranges have been merged

into one and they're also sawed so that the first element.

And the output is the element most to the left on the diagram.

So this is kind of a minimal representation

of the integers that belongs to the original IRanges.

And we'll see many functions that output normal IRanges.

In a way, the inverse to reduce is a function called

disjoin that can be incredibly handy when you need it, but

I've found that I mostly use it in seldom esoteric circumstance.

So disjoin here creates kind of also a set of disjoint intervals.

So disjoint, non-overlapping intervals.

When you manipulate IRanges there are set of functions that does kind of

a straightforward manipulation.

And one way of manipulating IRanges, is a manipulation that takes all of

the original ranges, and produces a single new range for each of the original ranges.

One example of that is the resize function.

Let's close this off here and look at resize.

So here we have an IRange of some length 4.

And we resize them around the start position, you can see the fixed

argument here that tells us that we want to resize it to a width 1,

we fix them to have around the start position.

More useful in my experience is to resize them from the center of the intervals.

In this case here, the original ranges have even number of elements and

the start position becomes the element to the left of the midpoint.

There are other types of manipulation, such as shift, flank, and so on.

Another way of manipulating IRanges is thinking of IRanges as sets of integers.

In other words, converting them to normal IRanges first.

Then we can think about doing stuff like union and intersection.

So let's take two IRanges here and look at them.

So we can take the union of them, and

we can see what comes out of it is a normal IRange.

We have merged things together.

And you can see here that in a way the union,

we immerse the neighboring cells together.

Another way of saying that is that, the union is

equal to first concatenating the two IRanges together and then column reduce.

We can also do it to section and set difference.

Now the real powers of the IRange's of library is the findOverlaps function.

findOverlaps allows us to relate two sets of IRanges to each other.

Let's take

an example here.

Let's look at them. And now we're going to do the overlap between them.

So the output of the findOverlaps function is this two dimensional matrix or

it looks like a two dimensional matrix.

When I call findOverlaps, I give it a query and a subject.

And that's what the two columns of the Hits object here refers to.

So the Hits object is really an adjacency matrix, or

it is a matrix of indices of the different overlaps.

So the first row, or the first element of the Hit object,

means that range number one,

in the query, overlaps range number one in the subject.

So let's verify that by hand.

So range number one of the query

is ir[1], and ir2[1].

And we can see that these two ranges can be overlapped.

And there are three overlaps total, and this gives us the indices.

We can access the query Hits and the subject Hits through the queryHits and

the subjecHits accessor functions.

So let's do this, overlap here, and

we get our, basically the column of what looks like a matrix here.

Note that there are repeat elements of the query Hits,

because range number two in the query, overlaps multiple ranges in the subject.

It's very common to call unique on both queryHits.

The subjectHits, in this case here, unique would give us, exactly

the elements of the query that overlap anything in the subject.

FindOverlaps is a complicated function.

It has a lot of arguments that allows us to deal with

whether or not there should be a minimal overlap.

Whether or not the overlap should just be like we've done it here.

Or an overlap means they should be exactly equal to each other for example.

And there's also a way of specifying what should be returned.

Should it be all the possible overlaps?

Or just the first overlap you encounter?

And so on and so forth.

This takes a while to become totally comfortable with, and

we will see more uses of it throughout the class.

So in many cases when you're running findOverlaps,

you're not really interested in the exact overlaps, you're just interested in,

how often do I see overlaps between a query set and a subject set?

And for this we have the convenient function,

countOverlaps that returns a vector.

In this case, it means that range number one in the ir1 overlaps ir2[1].

Element two overlaps two elements of ir2.

That's represented in the Hits object above.

And element number three doesn't overlap anything at all.

countOverlaps as faster and more memory efficient,

which matters a lot if you use this for extremely big, high ranges,

and we will be using it for extremely big, high ranges.

Finally we can also relate IRanges in a different way than through the overlaps.

We can look at which ones are close to each other.

So again, we take our two IRanges and we can ask,

which of these IRanges in ir2 are closer to the ones in ir1?

So in this case, you can match them up.

This is actually a function they use a lot in genomics.

It could be something like, we have a peak or we have a region of interest and

we want to find the nearest gene.

So this was an instruction to the basic uses of the IRanges package.

We're going to see more advanced uses of the package in later videos, and

this package is basically providing the foundation of the GenomicRanges package,

that we will discuss again and again in this class.