The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

Loading...

來自 Stanford University 的課程

Greedy Algorithms, Minimum Spanning Trees, and Dynamic Programming

487 個評分

The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

從本節課中

Week 4

Advanced dynamic programming: the knapsack problem, sequence alignment, and optimal binary search trees.

- Tim RoughgardenProfessor

Computer Science

At the beginning of the course, we talked about the sequence alignment problem,

a problem which is fundamental to modern computational genomics.

And we talked about the need for an efficient algorithm for solving that

problem, for finding the best alignment of two strings.

I'm pleased to report that at this point we're well prepared to give such an

algorithm. Indeed, such an efficient solution will

readily fall out of the dynamic programming recipe that we now have quite

a bit of practice with. So let me briefly jog your memory about

the sequence alignment problem. So the goal here is to compute a

similarity measure between strings, a similarity measure defined as the total

penalty of the best alignment, also known as the Needleman-Wunsch score.

So for example. if you've given as input the strings A G,

G, G C T at A G, G C A, a natural candidate alignment would be to stack

them on top of the other, inserting a gap in the shortest string after the two Gs,

that some sense represent the missing G. This is a pretty good alignment that

suffers from merely two flaws. So first of all, we did resort to adding

a gap in the second string. Second of all, there is a mismatch in the

final column. The A and the T get mismatched.

In general we evaluate the alignment by summing up the penalties of all the flaws

and the sum penalty per gap and the sum penalty per mismatch.

So, a bit more precisely, as input in this computational problem, we're given

two strings. I'm going to call them capital X and

capital Y. I'm going to use little x and little y to

denote the individual characters of these strings.

Let's say the first string capital X has linked M and the second string Y has

linked N. In addition to the two input strings we

assume we're given as input the values for the various types of penalties.

So that we know exactly how much it costs each time we insert a gap.

And for each possible mismatch, we need to know exactly what's the cost of that

mismatch. In principle you could be given a penalty

for matching a letter with itself, but typically that's going to be a penalty of

zero. The space of feasible solutions are just

the ways of inserting gaps into the two strings so that the results have equal

length. I should emphasize that you're allowed to

insert gaps into both of the strings. In the example, we only inserted into one

of the two strings, but in general, you might have an input

where one string is seven characters longer than the other,

and it might turn out that in the optimal alignment, the best thing to do is insert

three gaps at various places in the longer string,

and ten gaps at various places into the shorter string.

And the goal, of course is just to compute amongst all of the exponentially

many alignments, the one that minimizes the total penalty, where total penalty is

just the sum of individual penalties for the inserted gaps and the various

mismatches. So let's not be unduly intimidated by how

fundamental this problem is, and let's just apply the dynamic recipe,

the programing recipe that we have been using all along.

Now remember, the really key insight in any dynamic programing solution is

figuring out what's the right collection of sub problems.

And if you're feeling like your up to the black belt level in dynamic programing,

you might just want to try to guess what are the right collection of sub problems

for sequence alignment? But I don't expect you to be able to do

that at this point. And so, as usual we're going to derive

the correct collection of sub-problems. And we're going to do it by reasoning

about the structure of an optimal solution,

narrowing it down to a small number of candidates composed in various ways from

solutions to smaller sub-problems. Once we've figured out the small number

of possibilities for what the optimal solution could look like in terms of

solutions to smaller sub problems, we'll be able to drive a recurrence which in

effect just does brute force search through the small number of candidates.

And from the recurrence we'll be able to back out.

We'll be able to reverse engineer what are the various sub problems that we

actually care about and that we have to solve.

So let's do a thought experiment. What does the optimal solution have to

look like? And again, remember, this is exactly what

it is that we're trying to compute but that's not going to stop us from

reasoning about it. If someone handed to us on a silver

platter the optimal solution, what would it have to look like?

So consider any old pair of strings, capital X and capital Y, and an optimal

alignment of them. Let's visualize this optimal alignment as

follows. Let's write down the string X plus

whatever gaps get inserted into it on top, and right beneath it we'll write

down the string Y, with whatever gaps are inserted into it.

These two things have exactly the same length.

So to figure out the various cases of the structure for this optimal solution,

let's reason by analogy with the problems we've already solved.

So back when we were looking at independent sets of line graphs, our case

analysis was well, either the final vertex, the rightmost vertex of the path,

is in the optimal solution or it's not. In the knapsack problem, we said well

either the last item is in the optimal solution or it's not.

So we always looked at sort of the last part of the optimal solution, in some

sense the rightmost position. And happily, staring at this alignment,

we see we can once again focus just on the action in the right most, in the

final position. So now I have a question for you.

So in the independent set problem, there were two cases.

The last vertex was either in the optimal solution or it's not.

In the knapsack problem, there were also two cases, the final item was either in

the optimal solution or it's not. So my question for you is, in the

sequence alignment problem: when we focus on what's going on in the final position

of the optimal alignment, how many relevant cases do we have to study?

So the answer I'm looking for is B, three relevant possiblities for the contents of

the final position. Let me explain my reasoning, let's start

with the upper parts of the final position.

observe that if that's a character of the string capital X, it can only be the very

last character. It can only be little X sub N.

That's because that's where this string ends.

Now we don't know that little X sub N is in the final position, there might be a

gap. Similarly, in the bottom part of this

final position, there's two possibilities.

There's a gap, or, if it's a character of y, it has to

be the final character, little y, sub n. So that one seems to suggest four

possibilities, two options for the top,

two options for the bottom. But the hint of talking about relevant

possibilities is that it's totally pointless to have two, a gap in both the

top and the bottom. Why?

Well the penalty for gaps is non-negative, so if we just deleted both

of those gaps we'd get an even better alignment of X and Y.

And in studying an optimal solution, we can therefore assume we never have two

gaps in a common position. So that leaves exactly three cases.

It could be there's no gaps at all, that in fact this alignment matches the

character, little x of m, with little y sub n.

Or it could match the final character of capital X, with a gap.

Or it could match the final character of capital Y with a gap.

So the hope behind this case analysis is that we're going to be able to boil down

the possibilities for the optimal solution to merely three candidates,

one candidate for each of the three possibilities for the contents of the

final position. That would be analogous to what we did in

both the independent set and knapsack problems, where we boiled the optimal

solution down to just one of two candidates, corresponding to whether

either the final vertex or the final item, as the case may be, was in the

optimal solution. Another way of thinking about this is

we'd like to make precise the idea that if we just knew what was going on in the

final position, if only a little birdy would tell us which of the three cases

that we're in, then we'd be done by just solving the some smaller sub-problem

recursively. So let's now state for each of the three

possible scenarios for the final position,

what is the corresponding candidate for the optimal solution, the way in which,

it must necessary be composed with an optimal solution to a smaller sub

problem. So who are going to be the protagonists

in our smaller sub-problem? Well, the smaller sub-problem's going to

involve everything except the stuff in the final position.

So it's going to involve the string's X and Y, possibly with one character

remaining. So let's let x prime be x, with its final

character peeled off. Y prime's going to be y, with its final

character peeled off. So let me just remind you of how I

numbered the three cases. So case one is when the final position

contains the final characters of both of the two strings, that is, when there's no

gaps. Case two is when x, little x of n gets

matched with the gap and case three is when little y of n gets matched with the

gap. Alright, so let's suppose that case one

holds. This means that the contents of the final

position, includes both of the characters little x sub m and little y sub n.

So now what we're going to do is we want to look at a smaller sub problem.

And we want to look at the sub problem induced by the contents of all of the

rest of the positions. We're going to call that the induced

alignment. Since we started with an alignment, two

things that had to equal length, and we peeled off the final position of both, we

have another thing that has equal link so we're justified in calling it an

alignment. Now what is it an alignment of?

Well if we're in case one, that means what's missing from the induced alignment

are the final characters. little X of M and little Y's of N,

which means the induced alignment is a bona fide alignment of X prime and Y

prime. And certainly, what we're hoping is true,

is that the induced alignment is in fact, an optimal alignment of these smaller

strings x prime and y prime. This would say that when we're in case

one, the optimal solution to the original problem is built up in a simple way from

an optimal solution to a smaller sub problem.

We're of course hoping that something analogous happens in cases two and three.

The only change is going to be that the protagonists of the sub-problem will be a

little bit different. In case two, the thing which is missing

from the induced alignment is the final character of x.

So, it's going to be the induced alignment of x prime and y.

Similarly, in case three, the induced alignment is going to be an alignment of

x and y prime. So, this is an insertion, this is a

claim, it's not completely obvious, though the proof isn't hard, as I will

show you on the next slide. But assuming for the moment that this

assertion is true, it fulfills the hope we had earlier.

It says that indeed, the optimal solution can only be one of three candidates, that

one for each of the possibilities for the contents of the final position.

Alternatively it says, that if we only knew which of the three cases we were in,

we'd be done, we can recurse, we could look up a solution to a smaller sub

problem and we could extend it in an easy way to a optimal solution for the

original problem. So lets now move onto the proof of this

assertion. Why is it true that an optimal solution

must be built up from an optimal solution to the relevant smaller sub-problem?

Well all of the cases are pretty much the same argument so I'm just going to do

case one, the other cases are basically the same.

I invite you to fill in the details. So it's going to be the same type of

simple proof by contradiction that we used earlier, when reasoning about the

structure of optimal solutions for the independent set in knapsack problems.

We're going to assume the contrary, we're going to assume that the induced solution

to the smaller subproblem is not optimal, and from the fact that there is a better

solution for the subproblem, we will extract a better solution for the

original problem, contradicting the purported optimality of the solution that

we started with. So when we're dealing with case one, the

induced alignment is of the strings X prime and Y prime, X and Y with the final

character peeled off. And so for the contradiction, let's

assume that this induced alignment, it has some penalty, say capital P. Let's

assume it's not actually an optimal alignment of X prime and Y prime.

That is, suppose if we started from scratch we'd come up with some superior

alignment of X prime and Y prime, with total penalty P star, strictly smaller

than P. But if that were the case, it would be a

simple matter to lift this purportedly better alignment of x prime and y prime

to an alignment of the original strings x and y.

Namely we just reuse the exact same alignment of x prime and y prime, and

then in the final position, we just match x m with y n.

So what is the total penalty of this extended alignment of all of x and y?

Well, it's just the penalty incurred in everything but the final position, and

that's just the old penalty p star, plus the new penalty incurred in the final

position. And that's just the penalty corresponding

to the match of the characters x m and y n.

P star being less than p, of course p star plus alpha x m y n is

less than p plus alpha x m y n. But this second term is simply the total

penalty incurred by our original alignment of X and Y, right?

That alignment incurred penalty capital P, just in the induced alignment of X

prime Y prime, and it's total penalty was just that plus the penalty in the final

position, which is this alpha xn, yn. But that furnishes the contradiction that

we suppose that we started with an optimal alignment of X and Y,

yet here is a better one. So with that contradiction, it completes

the proof of the optimal substructure claim.