我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

來自 Johns Hopkins University 的課程

DNA 测序算法

401 個評分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

從本節課中

Edit distance, assembly, overlaps

This week we finish our discussion of read alignment by learning about algorithms that solve both the edit distance problem and related biosequence analysis problems, like global and local alignment.

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

Now we're going to modify our edit distance function from

the last practical to instead do global alignment.

So, if you recall, in the last practical for

edit distance we have a penalty of one for skip characters and for mismatches.

But in reality we might want to penalize certain kinds of errors more.

For example, a skip character,

we might want to penalize that more than just a simple mismatch.

So we're going to do that.

So I've already pasted in here my edit distance function,

which I'm going to modify to global alignment.

So to start with here, I'm just going to define the alphabet we're working with.

And then I wanted to find a two dimensional array which will have

all the penalties for the different mismatches we have.

So I'm going to call this score.

And this is, we have four characters in our alphabet.

So, I'm going to make this a five by five array.

So we'll have the mismatch of each character to each other character.

As well as the skips, the pounding for skipping each character.

So, the top left.

Four by four of this array will have the mismatches in each character against

each other character.

So A against A is a match, so we're not going to penalize that at all.

A mismatch of A to C would be a penalty of four.

A mismatch of A to G would be a penalty of two.

And a mismatch of A to T will be a penalty of four.

So, since A and G are both purines right?

>> Mm-hm.

>> Since those are both purines, they're more likely to have a mismatch

there than A against C or T, which are a different permadines is

a different type of nucleotide, a different, slightly different structure.

So we're penalizing the C and T more than G, right?

And then the last character in this row will just be if there, if an A is skipped,

so we're going to gives a penalty of E.

So going to do this for each row now.

So this is the row for C.

So C against A is a penalty of four, C against itself is zero.

Four, two, and then, for the skip is eight.

This is for G, so it'll be two, four, zero, four eight.

For T it should be 4, 2, 4, 0, 8 and

then the last row would just be all skips.

Just like that, did I get that right?

>> Mm-hm.

>> Okay, so this is our penalty matrix that we're going to use.

>> Okay.

>> So now, let's modify this edit distance functioning to use that.

So I'm going to rename this, global alignment.

The first three lines here are defining D.

Just filling it out with zeros.

Those don't have to change, at all.

Now these next steps are where we initialize the first row and

the first column.

So, for example, the first row of this matrix corresponds

to if we were to skip the first I characters of Y.

So we have to calculate the penalties for the skips for that.

So the top left character in our array is 0 and then we're just going to,

for each character in the first row, we're going to take the character immediately

to the left of it and add the penalty of skipping the current character in y.

So I just want to take the range from one.

Up to the N, since I don't want to touch the top left character.

That will always be zero.

And I take the character, i- 1, and I add the correct

penalty from that score function, score

alphabet.index of X to i minus one.

So this alpha dot index will give us the index of

the current character x in that so a number from one to four,

that will tell us what road to look at in our score array.

And then we're going to take the last value of that row,

because that will give us the penalty for skipping that character.

Does that make sense?

>> Mm-hm.

>> Okay, cool. And then we'll do the same thing for

the column.

So I want to go from one up to the end of y.

I'll take character directly above it in the column.

And now add on it, score -1.

Let's take the last row of that array and

then alphabet.index of y-1.

>> You need to do i minus 1 instead of minus 1.

>> Oh right, yes.

Now, we just have to fill in the penalties here, for

our horizontal distance, vertical distance and diagonal distance.

So, the horizontal distance,

the penalty here would be if we were to skip a character in y.

So, we want to do this similar to what we just did above.

Score the last row of that.

I'll get the character at J-1 and Y.

And so this will just add the penalty for

skipping whatever is the current character in Y.

And similarly for the vertical distance, we have to calculate the penalty for

skipping the character in x.

Now for the diagonal distance,

this penalty will be the score for a mismatch between two characters.

So, we'll just calculate the mismatch between the current characters in and

x and y.

Just like that, and then the rest of our function won't change.

So we're still going to calculate the minimum of these edit distances for

the current value in d.

And then at the end of our function we just return the value in the bottom right

corner of this already.

>> So we only had to change one, two, three, four,

five lines in the whole function.

>> Yeah. >> It's pretty similar.

>> Okay, so now let's test this out with a few strings.

So I'm going to just make up a random string X.

And I'll set Y initially to just be the same thing

And the other distance between two identical strings as we would

expect is zero.

If we have a skip in here, we skip a character,

now we have penalty of eight which is what we defined in our matrix

If instead of a skip we have a replacement, say we replace G with a C.

Now our penalty is four.

>> That's the transversion penalty.

>> Right.

If we replace G with A, which is another purine now we only have a penalty of two.

And if we have maybe a mismatch there and then a skip down here,

now we have a penalty of 10 which is 2 plus 8 for the mismatch and the skip.

So that's our global alignment function.