我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

來自 Johns Hopkins University 的課程

DNA 测序算法

358 個評分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

從本節課中

Edit distance, assembly, overlaps

This week we finish our discussion of read alignment by learning about algorithms that solve both the edit distance problem and related biosequence analysis problems, like global and local alignment.

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

We talked before about one way of solving the approximate matching problem using

the pigeon hole principle.

And in this lecture we're going to discuss a new family of methods

that use dynamic programming and the idea of edit distance.

But they also solve the approximate matching problem.

This new family of methods is very, very flexible.

It's used to solve many different biosequence analysis problems, including

global alignment and local alignment, which we'll also talk about soon.

Dynamic programming and edit distance don't depend on any exact matching

algorithms in the same way that the pigeonhole principle method did.

So we'll think of it as being a separate class of algorithm from the kinds that

we've examined so far,

which is why I'm showing it here sort of off on the bottom by itself.

So we've seen two different ways of measuring the distance between a pair of

strings, hamming distance and edit distance.

And hamming distance is the distance between two strings that are of

equal length.

And it just equals the number of substitutions that are required to turn

one of the strings into the other.

Edit distance is defined as the minimum number of substitutions or

insertions or deletions required to turn one string into another.

So here's a question.

If I give you two strings that are of the same length, and I ask you to find

the hamming distance between those two strings, is that easy or is that hard?

Well, it's actually quite easy.

So basically the way you would do it is you would line up your strings, x and y,

and go through position by position, comparing each position.

And in every case where you find the corresponding characters mismatch,

you would add 1 to a counter.

And then at the end of the day, you would just report back that counter.

So that's pretty easy.

So, how about an algorithm to calculate the editDistance between two strings?

Is that easy?

Well that's actually harder.

It's really not at all obvious off the bat what this algorithm could be.

But before we discuss what that algorithm is here are a couple of warm up questions

to get us thinking about hammingDistance and editDistance.

So, let's say that we have two same length strings X and Y.

And we'd like to say something about the relationship between the hammingDistance

of those two strings and the editDistance of those two strings.

Are they equal, or is one of them greater than the other, or greater than or

equal to the other?

What can we say about these?

Well, one thing we can say is that the editDistance between X and

Y will always be less than or equal to the hammingDistance between X and Y.

In other words, if you allow me to use only substitutions to turn X into Y,

I can do this with a certain minimal number of changes.

If you additionally allow me to use insertions and

deletions, I can potentially do it with fewer changes.

So here's an example.

So, here are two same length strings, X and Y.

And what's the hammingDistance between X and Y?

Again, that's not too hard.

We can just sort of go across and count the mismatches.

So one, two, three, four, five, etc., actually adds up to ten.

So the hammingDistance between X and Y is ten.

And now, what's the editDistance between X and Y?

It turns out that it's two.

So because these two sort of strategically placed insertions and

deletions are all that's needed to turn one of those strings into the other.

So the ability to use insertions and deletions in addition to substitutions is

something that allows us to potentially get from X to Y with fewer changes.

Okay, here's another question.

So say that x and y are different lengths.

What can we say about the editDistance between x and y in terms of their lengths?

So can we, for example, put a lower bound on the editDistance between X and Y?

So yes, the lower bound is that the editDistance has to be at least as

great as the absolute difference between the lengths of X and Y.

Why is that?

Well because in order to edit X into Y, let's say for

example X has length two and Y has length four.

In order to edit X into Y, we have at least have to introduce

as many edits as is required to make them the same length.

We might then require some more edits to make them exactly the same sequence,

but at the very least we have to make them the same length.

So the number of edits, the editDistance between two different length strings is at

least equal to the absolute difference in their lengths.

Okay, so onto an algorithm for finding editDistance between two strings.

So the idea behind this algorithm, the basic idea is that if we have two

strings and we would like to know the editDistance between them.

It helps us a lot if we know the edit distances between prefixes of

those two strings.

So for example, here I have two long strings, X and Y.

Imagine they're very long,

which is why they're shown disappearing off to the left there.

And then we have to imagine that they're very, very long and

that we're only really looking at the ends of the strings.

That's why they're kind of fading off over here.

We're only looking at the very end.

And say we know the editDistance between the prefixes of

the two strings that are highlighted here in red.

And let's say the editDistance between those two prefixes is 147, okay.

So this means that the minimal number of edits required to get from one of those

red strings to the other red string is 147.

And so to get from X to Y, to get from the complete string X to the complete string

Y, one way we can do that is to first edit the red prefix into the other red prefix.

First make those 147 changes that are required to turn one red

prefix into the other.

Then make one more change.

Then make one more substitution which is to turn this C into this A.

So, one way we can turn X into Y is to first turn this prefix into this prefix,

and then do one more substitution.

So that would require a total of 148 edits.

That's the kind of principle that we're going to apply.

If we know the edit distance between prefixes of the two strings,

we can calculate the edit distance between the entire strings.

So let's write these two strings, X and Y, in a slightly different way.

We'll just replace the big long prefix, the red prefix that you see here

with the Greek symbol alpha in the case of the string X and

with Greek symbol beta in the case of the string Y.

And now we can write the strings as alpha followed by C,

the base C, and beta, followed by the base A.

So the edit distance between these two strings can be calculated as

the minimum of three things, three terms.

Let's look at each of these terms.

So the first term here equals the edist between alpha and beta + 1.

And so what this term is saying, which is basically what we said before,

is that one way that we can edit the first string into the second

string is that we can first edit alpha into beta.

Then do one more substitution in order to turn C into A.

This second term here says that another way we can edit first string into

the second is by editing alpha C, alpha followed by C.

By editing that into beta, and

then inserting an A on the end of the second string.

So to say that again, we can edit alpha C into beta, then insert this A.

That's another way we can turn one string into the other.

And then the third term says that another way we can do this is we can edit alpha

into beta A and then insert a C at the end in order to make the strings the same.

So one or more of these three different ways of

editing the first string into the second will have the minimal number of edits.

So if we take a minimum of these three terms,

that gives us the answer that we're looking for.

Okay, and we can generalize this expression so that it's not in terms of

these specific strings and these specific bases, C and A, that you see here.

So let's just modify our expression here.

So now, we have alpha followed by x, which is a variable that could be any base.

And we have beta followed by y,

which again is a variable that could be any base.

And we can re-write our expression this way.

And it's very close to what we saw before, but

one difference is this delta function here.

There's this little bit, this little expression here, delta(x, y).

What does that mean?

Well, delta is just a function that returns 0 if x and

y are the same, or it returns 1 if x and y are different.

So, this makes sense because, again,

the interpretation of this first term is that one way we can change the first

string into the second is to change alpha into beta.

And then, possibly, if needed, to add one more substitution to turn the final

character of one string into the final character of the other.

But sometimes those characters will match, in which case we don't have to do,

we don't have to add one more substitution.

So if the delta function is 0, then they match, and

we don't have to do an additional substitution.

If the delta function is 1, then they don't match, so

we do need to do an additional substitution.

So that's what that delta function means.

And so at the end of the day, we have this expression.

And this expression could be considered a recursive function.

It's a function that calls itself.

That is, if you want to find the edit distance between two strings,

you first could find the edit distance between pairs of strings,

one or both of which might be a prefix of the two strings.

And to do that, this function can call itself recursively.

So to make that clear, let's look at some actual Python code.

So here is a snippet of code that you could use to solve the edit distance

problem.

This very first line of the code is evaluating the delta function.

So it simply evaluates to 1 if we have to add a substitution to turn the final

character of one string into the final character of the other.

This expression here is a minimum over three terms,

which correspond exactly to these three terms here that we just discussed.

So in each term we're recursively calling a function called edDistRecursive,

this function here.

So now let's write all this code in the form of a function called edDistRecursive.

Here it is, here's the code we just saw, except it's part of this function,

edDistRecursive, that takes two arguments, a and b.

And this function isn't quite what we want.

The problem with this function is that it doesn't know when it's done.

And so in other words, this function will keep calling itself with shorter and

shorter prefixes as arguments until eventually we'll get to the point

where one or both arguments is actually the empty string.

At which case, we know what the answer is, the edit distance between an empty string

and some other string is just the length of that other string.

So at that point we don't have to make any more recursive calls,

but as we've written it, the function isn't smart enough to know that.

So in addition to what we've already written, we want to add a little bit more.

We just need the following checks.

So that now we check if either or both of the strings has length 0,

then we'll simply return the length of the other string.

Now we know when we're done.

So, now we have a function that works.

This is a function that will return the edit distance between strings A and B.

Though, it has a problem, which we'll see in the following lecture.