The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

Loading...

來自 Stanford University 的課程

Greedy Algorithms, Minimum Spanning Trees, and Dynamic Programming

738 個評分

The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

從本節課中

Week 4

Advanced dynamic programming: the knapsack problem, sequence alignment, and optimal binary search trees.

- Tim RoughgardenProfessor

Computer Science

In this next sequence of videos we're going to study a slightly more intricate

application of the dynamic programming paradigm,

namely to the problem of computing optimal search trees.

Search trees that minimize the average search time, with respect to a given set

of probabilities over the keys. So I'm going to assume in these videos

that you remember the basics of the search tree data structure.

So if you need to jog your memory, you might want to review the relevant video

from part one. So a search tree is they contain objects,

each object has a key drawn from some totally ordered sets.

And the search tree property states that at each node of a search tree, say it

contains some object with a key of value x, it better be the case that everything

in the left subtree of that node has keys less than x and everything in the right

subtree under the node with key x has to have keys bigger than X. So that has to

be true simultaneously at every single node of the search tree.

The motivation for the search tree property is so that searching in a search

tree just involves following your nose, just like binary search in assorted

array. So if you're looking for, say, an object

with key seventeen, you know whether to go left or right from the root, just

based on the root's key value. If the root has key value twelve, you

know where seventeen, if it exists, has to be in the right sub-tree.

So you recursively search the right sub-tree.

If the root has value 23, you know where seventeen has to be in the left sub-tree.

So that's where you recursively search. Something we originally discussed in the

context of balanced binary search trees. Like red black trees.

And I'm going to reiterate now. Is that, for a given set of keys, there

are many, many valid search trees containing those keys.

So just to remind you how this works, let's even just suppose there were only

three keys in the world, x, y, and z, with x less than, y less than z.

One obvious search tree would be the balanced one.

So that would put the middle element y at the roots.

You would have left child x and right child y.

But there's also the two chain search trees containing these three keys, one

with the smallest element x at the root, the other with the largest element z at

the root. So, given the multiplicity of solutions,

all of the different search trees one could use to store objects with a bunch

of keys, an obvious question is which one is the best.

What's the best search tree to use out of all of the possibilities?

So I don't blame you if you've got a sense of deja vu.

We did already ask and answer this question in one sense, when we discussed

red black trees. There we argued that the best thing to

have is a balanced binary search tree that keeps the height as small as

possible and therefore the worst case search time, which is proportional to the

height, as small as possible, namely logarithmic in the number of objects in

the search tree. But now let's make the same kind of

assumption that we made when we discussed Huffman codes.

That is, let's assume that we actually have accurate statistics about how

frequently, each item in the tree is going to be searched for.

So maybe we know that item X is going to be searched for 80% of the time,

whereas Y and Z will only be searched for 10% each.

Could we then improve, upon the perfectly balanced search tree solution?

So let me make this question more concrete, just by asking you to compare

two candidate solutions. On the one hand the balance tree, which

has Y at the root and X and Z as children.

On the other hand, the chain which has X as a root, Y as its right child and then

Z as the right child of Z, excuse me, Z as the right child of Y.

So what is the average search time in each of these two search trees, with

respect to the search frequencies I told you, 80% for X, 10% for Y, and 10% for Z?

And when I say the search time for a node, I just mean how many nodes do you

look at on route to discovering what you're looking for, including that last

node itself. So the search time for something that's

at the root for example, that would be a search time of just one,

because you only look at the root to find it.

All right so the correct answer to the quiz is the fourth option 1.9 and 1.3.

So to see why, let's just compute the average search time in each of the two

proposed search trees. In the first one, with Y at the root,

well, 80% of the time we're going to suffer a search time of two, whenever we

look for X we have to look at the root Y and then we look at the X.

So we pay two 80% of the time and that contributes 1.6.

10% of the time we get lucky, we see a Y, that contributes a.1.

10% of the time we see Z, that contributes another.2 for a total of 1.9.

By contrast think about the chain that has X at the root.

Here 80% of the time we get lucky, and we only have to pay one.

Two for every search for X so that contributes only 8. to the total.

It is true our worst case search time has actually gone up.

When we see its Z we suffer a search time of three which never ever happened in the

balance case. But we pay that three only ten% of the

time, that contributes a.3. The remaining ten% of the time we suffer

a search time of two to look for Y. That gives us a total of 1.3.

And the moral of the story, of course, is that this example exposes an interesting

algorithmic opportunity. So the obvious, quote unquote, solution

of a perfectly balanced search tree, need not be the best search tree when

frequency of access is non uniform. You might want to have unbalanced trees,

like this chain if it gets extremely frequent searches, closer to the roots to

have smaller search time. So the obvious question is then, given a

bunch of items and known frequencies of access, what is the optimal.

Search tree. Which search tree minimizes the average

search time? So that brings us to the formal problem

definition. We're told n objects that we gotta store

in a search tree and we're told the frequency of access of each.

So let's just keep things simple and the notation straightforward, let's just say

the items are named from one, two, all the way up to n, and that is also the

order of their respective keys. So key one is the frequency of searching

for the item with the smallest key, and so on.

You might wonder where these frequencies come from.

How would you know exactly how frequently every possible key will be searched for.

It's going to depend on the application. And you know there will be applications

where you're not going to have these kinds of statistics.

And that's where you'd probably want to turn to a general purpose balanced binary

search tree solution, something like a red black tree, which guarantees you that

every search is going to be reasonably fast.

But it's not hard to think about applications where you are going to be

able to learn pretty accurate statistics about how frequently different things are

searched for. One example might be something like a

spell checker. So if you would implement that by storing

all of the legally spelled words in a search tree, and as you're scanning a

document, and every time you hit a word, you look it up in the search tree to see

if it's correctly spelled or incorrectly spelled.

You can imagine that after, you know, scanning through a number of documents,

you would have pretty good estimates, about how frequently things get searched

for. And then you could use those estimates to

build a highly optimized binary search tree for all future documents.

If you're in some other kind of application where you're concerned about

these frequencies changing over time, so, for example, if they're privy to trends

in the industry, you could imagine rebuilding the search tree every day or

every week or whatever, based on the latest statistics that you've got.

In any case, if you're lucky enough to have such statistics, what you're

going to want to do is to build a search tree, which, on one hand is valid.

It should satisfy the search tree property and on the other hand, should

make the average search time as small as possible.

So let me go ahead and write down a formula for the average search time.

It's the one that you would expect it to be.

And also introduce some notation, namely capital C of T.

We'll denote the average search time of a proposed search tree T.

So for these lectures, we're going to focus on the case where all searches are

successful. The only thing that ever gets searched

for is stuff that's actually in the tree. But everything we'll talk about in these

lectures and the algorithm is easily extended to accommodate the case where

you also have unsuccessful searches and statistics about how frequent the various

unsuccessful searches are. But, if there is only successful

searches, then we average only over the n elements that are stored in the tree.

So we sum over each of the items I, we weight it by the probability or the

frequency of it's access P sub I, and then that gets multiplied by the search

time required in the tree T to find the item I.

And as we discussed in the quiz, the search time for a given key I and a given

tree T is just the number of nodes you have to visit until you get to the one

containing I. So if you think about it, that's exactly

the depth of the node in this tree plus one.

So for example, if you're lucky enough that the key is at the root, the depth of

the root is zero, and we're counting that as a search time as one.

So it's depth plus one. So, one minor point.

It's going to be convenient for me to not insist that the PI's sum to one.

Of course, if the PIs were probabilities, they would sum to one.

But I'm going to go ahead and allow them to be arbitrary positive numbers.

And that, for the same reason, I'm going to sometimes call capital C of T,

the weighted search time rather that the average search time,

because I won't necessarily be assuming that the PIs, sum to one.

With that said, go ahead and, you know? Think of that as the canonical special

case in your mind as we go through these lectures.

So for example, in the case where these are probabilities, where the PIs sum to

one, we could always as a reference point use a red black tree as our search tree.

But, as we've seen, when these PIs are not uniform, you can generally do better.

And so that's the point of this computational problem.

Exploit the non-uniformities in the given probabilities to come up with the best

possibly unbalanced search tree as possible.

So I'm sure many of you will have noticed some of the similarities between this

computational problem of optimal binary search trees.

And one that we already solved back in the greedy algorithm section,

namely Huffkin, Huffman codes, which, amongst all prefix free binary codes,

minimize the average encoding length. So, let's just be precise about the

similarities and differences between the two problems, and in particular, why we

can just reuse the algorithm we already saw off the shelf, to solve the optimal

BST problem. So it's, of course, super similar in the

two cases, is the format of the output. In both problems, the responsibility of

the algorithm is to output a binary tree, and the goal is to minimize the average

depth, more or less, where the average is with respect to

provided frequencies, over a bunch of objects that we care about.

Characters from an alphabet in the case of Huffman codes,

and a bunch of objects with keys from some totally ordered set in the binary

search tree case. And it is true that in the optimal BST

case, we're not really averaging depths, we're averaging depths plus one but if

you think about it that's exactly the same thing.

More important is to understand the differences between the problem solved by

Huffman codes and the computational problem that we have to solve here.

In the Huffman code case, we had to output of binary code and the key

constraint was that it be prefix-free. And in the language of trees what that

meant is that the symbols that we're encoding had to correspond to the leaves

of the tree. Symbols could not correspond to internal

nodes of the tree that we output. Now in the optimal binary search tree

problem we do not have this prefix free constraint, so we're going to have a

label that is an object at every single node of our tree, whether it's a leaf or

not. But, we have a different, seemingly quite

a bit harder constraint to deal with, namely the search tree property.

So remember, back in the Huffman code case we didn't even have an ordering on

the symbols of the, of the alphabet. There wasn't a sense that one of them was

less than another, it wouldn't even make sense to talk about the search tree.

Brought in that context. Here by contrast, we're given these keys

and there's this total lowering on them. And we'd better satisfy the search tree

property in the tree that we output, that is in every single node in the tree that

we output, it better be the case that all keys in the left sub-tree are less than

the key at that node, and all keys in the right sub-tree are bigger than the key at

that node. That's a constraint that we have no

choice but to satisfy. This constraint is harder in the sense

that no greedy algorithm, Huffman's Algorithm or otherwise, solves the

optimal binary search tree problem. Rather we're going to have to turn to the

more sophisticated tool of dynamic programming to design an efficient

algorithm for computing optimal binary search trees.

That's the solution we'll start developing in the next video.