0:03

Next we're going to talk about the analysis of path length in binary trees

and this is an important problem to study because,

that's the quantity that we use to predict performance in the analysis

of the binary search algorithm that I talked about before.

So, here's the definition that I gave before for

a binary tree in the level and the depth.

When we talk about path length, what we what to compute is

the average length of a path to a node that's in the tree, say an internal node.

So to do that we compute the total internal path link and

then dividing it by n gives that average cost.

0:52

And so to compute the total, you just count the number of nodes at each level.

So level one there's 2 nodes, the level 2 there's 4,

level 3 there's 3, 4 there's 1, 5 there's 1, and you add that all up.

That's called the internal path length of a binary tree,

it's the sum over all, nodes of

length of path from the root to that node, or the sum of k times the number at k.

1:31

And also, that's a quantity of interest in studying the binary search algorithm.

So these things, there's relationships among these that are easy to prove.

It's all recursive and there's a lot of notation to be able to talk about

these things so let's look at some simple relationships among them.

So we're talking about binary trees and we have a size function.

And just a different size function for external nodes.

And then left and right subtrees we'll indicate with tL and tR.

IPL stands for the internal path length.

XPL stands for the external path length.

So that's all the notation.

And again, these are simple quantities to define, as they did on the previous slide.

So here's some recursive relationships.

So one thing is,

the number of internal nodes in a tree is the number of internal nodes in the left,

plus the number of internal nodes on the right, plus one for the root.

2:49

Internal path length, so this is an interesting one.

If you want the internal path length of the whole tree you take the internal path

length on the left and the internal path length on the right and add those together

and then what you are is you're off by one on every single node except the root.

So if you add t minus one then you're adding this link to all the nodes.

And so that's a recursive relationship for the external path length and

for internal path length and for external path length it's similar.

External path length on the left, external path length on the right but

you're off by one for every node because you didn't take into account the nodes

that connect the subtrees to the root.

So those are simple, but very useful, recursive relationships.

And then from those relationships, you can prove easy things,

like the number of external nodes is equal to the number of internal nodes plus 1,

and that's just by induction.

3:57

So just plug in from this formula here.

The number of external nodes is the sum of the external nodes in the two parts.

By the inductive hypothesis, that's equal to tl plus 1 plus tr plus 1.

And then using the recursive relationship for

internal notes, we're left with internal nodes plus 1.

4:28

plus 2 times the number of internal nodes in the tree.

And again that's inductive right from the simple recursive relationships

that we talked about.

So, and again, there's a lot of ways to prove those things, too.

Just using these as an exercise for getting used to the idea of path length,

and its relationship with a path length and

the sub trees, to the path length in the whole trees.

That is critical to the analysis that we are going to do.

4:57

Okay, so here's our first problem.

We have a random binary tree so that's a binary tree where all tree shapes

are equally likely with probability one over, catalan numbers one over N plus.

What's the average path length of a random binary tree?

That's some kind of description of the tree shapes that we saw.

How far are all the nodes from the root and

how far is an average node from the root?

That's sort of what we're saying with that.

So, these are the quantities that we're going to work with.

So Q and K is the number of trees with N nodes and internal path link of K.

TN is the number of trees that's the cattlen numbers and

to accumulate costs is the total internal path link on all trees.

So that we get the average by dividing the cumulated cost by the number of trees.

That's the way that we compute average values of parameters.

So this is for two, so in this case, both of the trees

with two nodes have internal path length of 1.

So the total accumulated mature path length is 1,

the average internal path length is 1, that's still not so interesting.

Now this is a little more interesting.

So there's five trees with three nodes.

Four of them have internal path length, three.

0 to the root, one to one below.

And two more to the next one.

So that's three.

Four of them have that structure.

One of them has internal path length.

Two the nodes are both just one from the root.

So Q32 equals one.

Is one tree size two has an internal path length to one and

there's four trees to size three that have internal path length three.

And so, that's a bug, it's t3=5, so

the average internal path length is 14 over 5, it's 2.8.

So that's the figures for three and here's the corresponding figures for four.

Total path lengths in all of these trees is 74 and there's 14 of them.

So the average expected internal path length

is 5.2 in this trees at size 4.

And then you can take that and say that the average distance to an internal node,

divide that by 4 again to get about how far a node is from the root.

So that's the quantities that we're after.

It's always good to do small cases like this to

make sure that you have a good fix on the exact quantities that you're analyzing.

7:50

Expected path link of a random binary tree.

So, here's the analytic common [INAUDIBLE] deviation from this,

and these are the things we can defined so far.

So the counting GF is just the Catalans and

if that's all the information we know about the Catalan numbers.

The cumulative cost generating function is sum over for

all trees internal path length times z to the tree size.

8:20

And that's also equal to the sum on N, sum on k and the path length k and so forth.

But from now on we're just going to go with cumulative counting.

And we know that the average is going to be the coefficient of z to the n and

the Q divided by the coefficient of z to the n and the T.

Now the coefficient of z to the n and

the Q is the total internal path length of all trees the size n.

If you divide that by the number of trees the size n then you get the average

internal path length.

9:31

And so from that definition we can

immediately write down this decomposition.

This first one's for the empty tree and this other one is for the roof.

But if we have that function, we look at this decomposition, ipl(t),

is exactly equal to that and the size of T is exactly equal to that.

And so we have that double sum for the cumulative cost.

And that decomposition or

looking the other way that construction now we can rearrange terms.

And you could see for example ipl(tl)z to the phi l times E to tr, that's Q(z)T(z).

So that's one of the terms that we find in there.

And the other ones where these tl's come into effect we get T'(z)T(z).

This formula has those four terms, and

using these two equations it immediately reduces to that simple equation.

Relating the counting generating function in the cumulative generating function.

The recursive equation that relates those two generating functions.

Extremely simple argument.

It's actually possible to develop this symbolically and

we'll talk about derivations like that in part two.

But it's worthwhile having the explicit decomposition in front of us to see

how directly we get to the equation that matters, which is the equation relating

the generating functions that we can then solve and analyze.

11:33

And then we can just.

T(z), remember that's the Catalan, and we know what T(n) is.

And T prime is just the derivative of that, which has two terms in it.

But there is one thing that simplifies calculations a lot.

1- 2zT(z) If you multiply this by 2z it's just the square root of 1- 4z so

that takes care of the denominator.

So there is still a little bit of algebra multiplying these two things together but

there's lots of cancellations because the square root of 1- 4z times the square root

of 1- 4z is just 1- 4z, so I'm not going to pretend.

I'm going to admit some of the algebra.

But the final result is pretty simple.

So that's an explicit equation for the cumulative counting generating function.

And what do we need?

The coefficient of z to the N in that is the internal

path length of all the trees of size N.

12:43

So the take all those binary trees and add up all

their internal path lengths and you get four to the vn surprisingly simple.

It's not exactly four to vn but it it's four to vn and

that's why we want to do precise before, because if you take that and

divide it by the Catalan, all you're left with is N squared of pi N.

13:06

We divided two huge quantities, but

since we were working with precise results, we get a precise answer.

Average internal path link of a random, binary tree,

is asymptotic to n square root of pi n.

And that's a very accurate estimate [COUGH] for

the average internal path link.

And we could do it, we could get it exactly, but doing it asymptotically is

very compelling, because it's a simple an surprising and unexpected result.

13:42

So, now let's look at the same thing for random minor researches.

So, that one explains, or starts to explain the shape

of a Catalan tree the nodes go down square root of n.

Which in a 10,000 node tree goes down 100

a pretty good size depth in that's average so it goes down a few hundred.

What about binary search trees?

14:13

Well here's the same calculation for binary search trees.

But now we're counting permutations that result in a binary

search tree with N nodes and internal path length k.

And now it's a little easy because the counting factor is N factorial,

we know that.

But still the cumulated cost is the same way.

It's just that we have to count the number of permutations.

So, in this case, some of the trees have internal path length four,

five, and six have before, but the weighting is different.

So all of these twelve permutations give four.

There's only four of them that give five that's these four, and

the remaining eight give six.

So the total internal path length of the trees that you

get from all of the 24 permutations 74.

If you divide that by 24, you get 4.833, which is less than what you get for

entries because the balanced ones are weighted much more likely.

15:38

So start as usual, but now we have permutations.

And so our cost is the length of the permutation.

But now, when we say, ipl, we mean, its the ipl of a permutation.

Which doesn't make sense, except if you say that the permutation,

what we want is the internal path link to the BST,

you've got from that permutation by inserting it into a initially empty tree.

And again, the number of permutations of size N is N factorial, and

that saves us a step in the calculation, as you'll see.

Accumulated cost is the total ipl of binary search trees built from all

permutations.

16:37

Our cumulative cost, EGF, it's the same symbols as before,

except now we have permutations, not trees, and ipl has that different meaning.

And so to get the expected internal path length of a binary search tree built

from a random permutation, we're going to take N factorial coefficiency,

the N in that, divided by N factorial, and those N factorials cancel out.

So it's just almost as treating it as an ordinary generating

function gives us the divide by N factorial for free.

It's just a little calculation trick because we're using exponential generating

functions, which means that we divide by N factorial.

17:22

But our probability space is permutations, so we want to divide by N factorial,

so it serves both purposes and saves us the step of dividing.

We just look for the coefficient of z to the N and C(z).

So, next, we have to look at developing a generating function equation for

that C(z), using the recursive definition of the binary tree structure,

according to the way that we construct trees.

18:11

And then, what we're going to do is use this relationship that we talked about

before that gives us all the permutations that now lead to the same tree.

And so, that decomposition from a permutation

P to a permutation that ipl(p) is,

those are all the permutations that give rise to that.

And ipl(p), it's got that recursive relationship as before.

We can express everything there in terms of pL and

pR because of this decomposition that we already talked about.

And this one's even simpler than the other one.

There's just one trick that often works with exponential generating functions.

This (|pL| + |pR| + 1) factorial causes in the denominator,

causes a little inconvenience in simplifying this formula.

But we have z to that same quantity.

So what we do is differentiate to cancel out that one factor.

Then you'll see there's a pL + pR factorial on the bottom.

And then we can mix that with the binomial coefficient.

19:32

So this is a trick that often works with exponential generating functions.

Differentiate, then cancel the pL +

pL in the binomial coefficient, and you're left with a pL factorial, pR factorial.

And now you've got a very simple double sum

that completely decomposes and separates.

19:55

The [COUGH] p of zb/p factorial is just P(z), and

P'(z) is the one that takes care of pL/pL factorial,

it's (pL-1) factorial, so use this equation.

And then again, we get a very simple differential equation now for

the cumulative cost exponential generating function for path length in BSTs.

20:23

And decomposition's important.

It's gotta be precise, it's gotta be correct.

But once you have that decomposition, you automatically get a relationship,

that the exponential generating function for

the accumulative cost in the counting EGF that you have to satisfy.

[COUGH] And again, solving for C(x) gives,

it's just simplifying that, substituting in for

P(z) gives a very simple equation.

So recognize that equation, we saw that in the first lecture actually.

That's the equation that came up with solving the quick sort recurrence.

I guess it came up in the generating function lecture,

pretty much the same equation.

21:15

So, just with that observation,

we solve that because it's a first order differential equation

that has a simple integration factor and was not difficult to solve.

So now we can summarize the analysis of

expected length in a BST built from a random permutation on this slide.

21:52

Differentiate to simplify.

Substitute to get further simplification.

And that is a generating function that has the solution 2/(1-

z) squared, log of 1 / 1-z- 2z / (1-z) squared.

And those are elementary series that now we can expand to find the coefficients,

as we did before, to show that the average internal path length in

a binary search tree is asymptotic to 2 and natural log n.

So a log n as a factor, if you divide by n,

log n to the average node in a binary search tree,

square root of n in a binary catalan tree.

Since the same equation comes up and

people who have had algorithm courses know that there's

a bijection between quicksort and binary search trees that explains this.

We could have analyzed binary search trees just by taking

advantage of this bijection.

So that is in quicksort, you have the first entry in the permutation is

the partitioning element and the smaller ones and larger ones are mixed.

And then after the partitioning, you do them independently.

And that's pretty much the same in a binary search tree.

The first one is the root, and then you do the left and the right independently.

So, you can show that the average number it compares for quicksort is exactly

the average external path length to BST built from a random permutation.

And that's an interesting bijection to know, to take advantage of.

Now this same approach works for a lot of parameters of trees, and there's exercises

that compute the number of leaves of trees and other things like that in the text.

23:55

For finding the height of trees, it's much more intricate.

It's a different approach because it doesn't break down really as well in that,

but that's an interesting problem, because it's a natural thing to

want to know what's the furthest node from the root in a tree.

That tells us more information about the shape of the tree.

So the height derivation is described in the text.

And actually for both binary search trees and trees,

the height was an open problem for quite a while.

So just to summarize what we know about the shapes of

these two different tree structures, we looked at random binary trees and

binary search trees built from random permutation.

Those are typical shapes of those trees.

The average path length, that's the average

distance to a node in a random tree, for binary trees, it's square root of pi N.

For BSTs from random permutation, it's 2 ln N.

And again, in the book, there's some description of,

although both of these derivations are quite intricate,

the height now is known for random binary trees to be twice that average.

25:11

And for random BSTs, it's a little more than twice.

It's a constant times ln N, where the constant is about 4.31.

And, again, there's lots of things that you might want to study about trees, and

I've only really talked in detail about path length.