So let me explain what I hope is clear and what is maybe unclear at this
juncture. I hope what's intuitively clear is that
the bottom of approaches, a systematic way to build trees that have a prescribed
set of leaves. So what do we want?
We want trees whose leaves are labeled with the symbols of the alphabet sigma.
So if we have an alphabet with n symbols, we're going to start with just the N
leaves. What does a merger do?
Well, there's two things. First of all, it introduces a new
internal node, an unlabelled node. And secondly, it takes two of our old
subtrees and fuses them into one, merges them into one.
We take two subtrees, we make one the left child of this new internal node, the
other, the right child of this new internal node.
So that drops the number of subtrees we're working with by one.
So if we start with the N leaves and we do N minus one successive mergers, what
happens? Well on the one hand, we introduce N
minus one new unlabeled internal nodes. And on the other hand, we construct a
single tree. And the leaves of this tree that we get
are in one to one correspondence with the alphabet letters as desired.
Now what I don't expect you to have an intuition for is what should we be
merging with what and why? Forget about, you know, how do we get an
optimal tree at the end of the day. I mean, even just if we wanted to design
a greedy algorithm. If we just wanted to make a myopic
decision, that looks good right now, how would we even do that?
What's our greedy criteria that's going to guide us to merge a particular pair of
trees together? So we can re-frame this quandary in the
same kind of question we asked for minimum cost spanning trees and really
more generally with greedy algorithms. When you're making irrevocable decisions
which strikes fear in your heart, is that this decision will come back and haunt
you later on. You'll only realize at the end of the
algorithm that you made some horrible mistake early on in the algorithm.
So just as for MST's, we ask, you know, when can we be sure that including an
edge irrevocably is not a mistake, it's safe in the tree that we're building?
Here we want to ask, you know, we have to do a merger, we want to do successive
mergers and how do we know that a merger is safe?
That it doesn't prevent us from eventually computing an optimum solution.
Well here's the way to look at things that will at least give us an intuitive
conjecture for this question. We'll save the proof for the next video.
So what are the ramifications when we merge two subtrees, each containing a
collection of symbols? Well, when we merge two subtrees, we
introduce a new internal node which unites these two subtrees under them, and
if you think about it, at the end of the day on the final tree, this is yet
another internal node that's going to be on the root to leaf path,
for all of the leaves in these two sub trees.
That is, if you're a symbol and you're watching your subtree get merged with
somebody else. You're bummed out, your like, man that's
another bit in my encoding. That's yet one more node I have to pass
through to get back to the root. I think this will become even more clear
if we look at an example. So naturally, we'll use our four symbol
alphabet A, B, C, D. And initially, before we've merged
anything, each of these is just its own leaf A, B.
C, D. So there's no internal nodes above 'em.
In the sense, everybody's encoding length at the beginning is zero bits.
So now, imagine we've merged C and D, we introduce a new internal node, which is
the common an-ancestor of these two leaves.
And as a result, C and D are bummed out. They said, well, there's one bit that
we're going to, to have to incur our encoding length, there's one new internal
node we're always going to have to pass through enroute back to the root of the
eventual tree. Now suppose next we merge B with the
subtree containing both C and D. Well everybody but A is bummed out about,
about this merger because we introduce another internal node, and it contributes
one bit to the encoding of each of B, C, and D.
It contributes an extra one to the encoding of C and D, and it contributes a
zero to the encoding of B. So all of their encodings in some sense
inc, get incremented, go up by one, compared to how things were before.
Now in the final iteration, we have to merge everything together.
We have no choice, there's only two sub-trees left.
So here, everybody's encoding length gets bumped up by one.
So, A finally picks up a bit at zero to encode it and B, C, and D each pick up an
additional bit of one, which is prepenned into their encodings thus far.
And, in general what you'll notice is that the final encoding length of a
symbol, is precisely the number of mergers that its subtree has to endure.
Every time your subtree gets merged with somebody else you pick up another bit in
your encoding, and that's because there's one more internal node that you're going
to have to traverse enroute to the root of the final tree.
In this example, the symbols C and D, well they got merged with somebody else
in each of the three iterations. So, they're the two symbols that wind up
with the encoding length of three. The symbol B, it didn't get merged with
anybody in the first iteration, only the second two.
That's why it has an encoding length of two.
So this is really helpful. So this, lets us relate, the operations
that our algorithm actually does, namely mergers back to the objective function
that we care about, namely the average encoding length.
Mergers increase the encoding lengths of the participating symbols by one.
So this allows us to segway into a design of a greedy heuristic for how to do
mergers. Let's just think about the very first
iteration. So we have our N original symbols, and we
have to pick two to merge. And remember the consequences of a merge
is going to be an increase in the encoding length by one bit, whichever two
symbols we're going to pick. Now we want to do is minimize the average
encoding length with respect to the frequencies that were given.
So which pair of symbols are we the least unhappy to suffer an increment to their
encoding length, was going to be the symbols that are the least frequent.
That's going to increase your averaging poding length by the least.
So that's the green merging hiuristic. Somethings gotta increase.
Pick the ones that are the least frequent to be the ones that get incremented.
So that seems like a really good idea of how to proceed in the first iteration.
The next question is, how are we going to recurse?
So, I'll let you think about that in the following quiz.