我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

來自 Johns Hopkins University 的課程

DNA 测序算法

392 個評分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

從本節課中

Preprocessing, indexing and approximate matching

In this module, we learn useful and flexible new algorithms for solving the exact and approximate matching problems. We'll start by learning Boyer-Moore, a fast and very widely used algorithm for exact matching

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

The next data structure we can use to implement a multi-map is a hash table.

Hash tables are extremely versatile, efficient, and

widely used data structures.

They are more or less the standard data structure used to represent sets and

maps in practice.

So let's walk through an example where we're going to use a hash

table to implement the same kind of multi-map that we implemented before

using an ordered list of pairs.

So, here's an empty hash table.

An empty hash table consists of an array of buckets, and

the buckets start out as empty boxes.

You can think of these as null references or null pointers.

As we add items to the hash table, these empty boxes will become lists.

Also associated with this table is a hash function,

which we'll refer to with this red h.

And the hash function maps each distinct key,

each distinct 3-mer onto one of these buckets in this array.

In our grocery store analogy,

you can think of the buckets as aisles of a grocery store, and

the hash function as the way that we assign groceries to aisles.

So now, let's add all of the 3-mers of T to this table.

So we'll start with the first 3-mer.

And we use our hash function to assign it to a bucket.

And let's say the hash function assigns it to the bucket indicated with

the red arrow here.

And so now we append the corresponding key value pair to this bucket, like this.

So a given bucket is simply a link list.

So to add our key value pair, we're simply appending an entry onto that link list.

And the entry consists of a key, which is the 3-mer,

which is highlighted in blue here.

And then a value, which is the offset from which we got that 3-mer,

which is shown in green here.

And then a null reference, which is shown in pink.

And so if we add some more entries later,

we add them on to the end of the list, by expanding on this null reference here.

So now let's add the second 3-mer.

So say the hash functions assigns it to the top bucket, and so

we append the corresponding entry to that bucket like this, and

then let's do the third 3-mer, and

let's say the hash function assigns it to that bucket indicated with the red arrow.

So we append the element to that bucket.

For the fourth 3-mer we do the same thing, and then now for

the fifth 3-mer this is one that we've seen before, we've seen GTG before.

So the hash function assigns it to the same bucket as before.

And since there's all ready an entry in that bucket we have to add it

on to the end of the list that's already present in that bucket.

So, we could've pre-pended it to the beginning of the list too.

It's not all that important which we do.

So moving on, the next 3-mer is TGT.

So let's say the hash function assigns TGT to this bucket.

So again, we'll append the corresponding entry onto the end of the list.

Now if we stop and consider the entries that are in this bucket now,

the one toward the bottom of this slide,

we notice that two different 3-mers are now both present in this bucket.

This isn't too surprising since we have many more

possible 3-mers than we have buckets here.

So the pigeonhole principle tells us,

we expect some distinct 3-mers to end up in the same bucket together like this, but

when this happens, we call this a collision.

And when there are many collisions,

then querying the data structure can become somewhat slower.

But it does happen now and then and it's not a terrible thing, but

it's something to take note of.

So onto the next 3-mer.

We add this onto the end of this list.

And then, the next one, which is TGG, which we add like this.

And then, we go into the first of three 3-mers that are all triple G.

Three GGG's.

So let's say the hash function assigns GGG to this bucket here, so

we append it onto the end of that list.

And we notice that once again we have two different 3-mers that

coexist in the same bucket, so that's again a collision.

Then we move onto the second GGG and the third GGG, and

of course those get assigned by the hash function to the same bucket, so

they get appended on to the end of that list.

Now say that we want to query this hash table.

We want to query this index.

So, let's say we have our pattern P here, and

we're going to extract some 3- mer from this pattern.

It doesn't have to be the first one, it could be.

It doesn't matter which of the 3-mers we pick.

Because when we built our index we used all of the 3-mers from the text phi.

So let's go ahead and pick this one, this GGG, this triple G.

And we want the index to tell us all the offsets where triple G occurs within

the text T.

So the first thing we do is we use our hash function

to map triple G onto a bucket, we get this bucket here.

We know that this is the one and only bucket that we have to look in

because this is the bucket that the hash function assigns triple G to.

In other words, we know we're in the right aisle of the grocery store now,

we just have to go looking down the aisle.

So we look at the first entry in the list.

And we look at the key, at the 3-mer.

And we noticed that this key is not equal to the 3-mer that we're querying with,

it's not equal to triple G.

So we ignore it and move on.

And then for the next three entries in the list we noticed that the key does match

our query 3-mer.

In all three cases it's triple G.

So in each case, we know that the corresponding offset in that key-value

pair is one of the ones that the index is going to report

back as being an offset within T where triple G occurs.

So at the end of the day, we got offsets 8, 9, and 10.

So in python, it's very easy to build and use a hash table

because the python dictionary type is an implementation of a hash table.

So here, I've initialized a python dictionary

in a way that corresponds to the situation from the previous slide.

This dictionary here.

And here are the three-mers, the keys in this dictionary, and

then associated with each 3-mer is a list of offsets where that 3-mer occurs.

So for example, triple G, the 3-mer that we were quarrying with before,

occurs at offsets 8, 9, and 10.

And to query this dictionary, we simply use square bracket notation here, so for

example, if we want to find all the offsets where the triple G query occurs,

then we do this, and we get back the list 8, 9, 10.

Just like form our previous example.

If we want to know where the triple CGT occurs, we again can use square bracket

notation and we get back a list that has one entry which is the offset 3.

So the details here are hidden from us.

We don't know what hash function is being used or

exactly how many elements are in the hash table.

All of that is taken care of by Python.

But under the hood, when you use a dictionary in Python,

you're using something like the hash table that we just saw.