We now discuss a very important class of prefix codes, called Huffman codes, which is a simple construction of optimal prefix codes. For the binary case, the procedure is very simple. Keep merging the two smallest probability masses, until one probability mass, that is 1, is left. For the D-ary case, first insert zero probability masses, until there are D plus k times D minus 1 masses, if necessary. If the number of probability masses has this form to start with, then we don't have to add any zero probability masses. Then, keep merging the D smallest probability masses, until one probability mass, that is 1, is left. In general there can be more than one Huffman code. [BLANK_AUDIO] Our remaining discussion, will focus, on the binary Huffman procedure. After that, we will discuss the D-ary Huffman procedure. Here's a very simple example of the binary Huffman procedure. Suppose we have a probability distribution, with probability masses 0.35, 0.1, 0.15, 0.2, and 0.2. First, we identify the two smallest probability masses, 0.1 and 0.15. We merge them into one probability mass, 0.25. In the remaining set, we identify 0.2 and 0.2 are the two smallest probability masses. Then we merge them into one probability mass, 0.4. Now in the remaining set, which consists of three probability masses, we identify 0.25 and 0.35 as the two smallest probability masses. Then we merge them into one probability mass, 0.6. Now, in the remaining set, there are only two probability masses, namely, 0.4 and 0.6. And obviously, they are the two smallest probability masses. And we merge them into one probability mass, which is equal to 1. Now we have formed the code tree. By adopting the convention that going up is 0 and going down is 1, we can assign the codewords accordingly. For example, the codeword assigned to the probability mass 0.35, is the codeword 00. [BLANK_AUDIO] We are now going to establish the optimality of Huffman codes. That is, the expected length of a Huffman code is the smallest possible. Without loss of generality, we assume that the probability masses are indexed in descending order. Denotes the codeword assigned to p_i by c_i and its length by l_i. [BLANK_AUDIO] Theorem 4.17, says that the Huffman procedure produces an optimal prefix code. Before we prove theorem 4.17, we need to establish two lemmas. First, Lemma 4.15 says that, in an optimal code, shorter codewords are assigned to larger probability masses, that is l_1 is less than or equal to l_2, so on and so forth. Lemma 4.16 says, that there exists an optimal code in which the codewords assigned to the two smallest probabilities are siblings. That is the two codewords have the same length and they differ only in the last symbol. [BLANK_AUDIO] We first prove Lemma 4.15, that is, in an optimal code, shorter codewords are assigned to larger probabilities. Consider a probability distribution p_1 up to p_i up to p_j, up to p_m, such that p_i is strictly greater than p_j. Assume that in a particular code, the codewords c_i and c_j are such that l_i is greater than l_j. That is, a shorter codeword is assigned to a smaller probability. [BLANK_AUDIO] Intuitively, by exchanging c_i and c_j the expected length of the code should be improved. [BLANK_AUDIO] Specifically, let L be the expected length of the code, which is given by summation k, p_k l_k. We now separate out the term i and j, so we first sum over all k, not equal to i,j, p_k l_k, and then we sum p_i l_i plus p_j l_j. And we let L prime be the expected length of the code, obtained by exchanging c_i and c_j, so L prime is equal to summation k not equal to i and j, p_k l_k plus p_i l_j plus p_j l_i. Note that l_i becomes l_j here, and l_j becomes l_i here. [BLANK_AUDIO] Comparing L prime and L, we see that, the summation in L cancels with the summation in L prime, so that L prime minus L is equal to p_i l_j plus p_j l_i minus p_i l_i plus p_j l_j. Rearranging the terms, we have p_i l_j minus p_i l_i, minus open bracket p_j l_j minus p_j l_i close bracket. [BLANK_AUDIO] Now, factorizing out p_i, we have p_i times l_j minus l_i. Similarly, factorizing p_j, we have minus p_j times l_j minus l_i. Then we factor out l_j minus l_i, to obtain p_i minus p_j times l_j minus l_i. This product is negative, because p_i is strictly bigger than p_j, and l_i is strictly bigger than l_j. Therefore, we conclude that L prime is strictly less than (L). Since the original code can be improved, it is not an optimal code. Therefore, for an optimal code, shorter codeword lengths are assigned to larger probabilities. The lemma is proved. [BLANK_AUDIO] There exists an optimal code in which the codewords assigned to the two smallest probabilities are siblings. Consider any optimal code. From Lemma 4.15, the codeword c_m has the longest length. Then the sibling of c_m, cannot be the prefix of another codeword. This is illustrated in the figure. We claim that the sibling of c_m must be a codeword. To see this, assume that it is not a codeword, and it is not the prefix of another codeword. Then, we can replace c_m by its parent, to improve the code, because the length of the codeword assigned to p_m is reduced by 1, where all the other codewords remain unchanged. This is a contradiction to the assumption that the code is optimal. Therefore, the sibling of c_m must be a codeword. [BLANK_AUDIO] If the sibling of c_m is c_m minus 1, then the code already has the desired property. That is, the codewords assigned to the two smallest probabilities are siblings. [BLANK_AUDIO] If not, assume that the sibling of c_m is c_i, where i is less than m minus 1. Since the code is optimal, by lemma 4.15, we have l_i less than or equal to l_m minus 1 less than or equal to l_m. On the other hand, since c_i and c_m are sibling of each other, they have the same length. And so l_i is equal to l_m. Therefore, l_i is equal to l_{m-1}, is equal to l_m. That is, c_i, c_{m-1} and c_m, all have the same order. [BLANK_AUDIO] Then we can exchange the codewords c_i and c_{m-1} without changing the expected length of the code, that is the code remains optimal, to obtain the desired code, that is c_m and c_{m-1} are sibling of each other. The lemma is proved. [BLANK_AUDIO] We now introduce the concept of reduced code tree. Suppose c_i and c_j are siblings in a code tree, then l_i is equal to l_j. If we replace c_i and c_j by a common codeword of their parent, call it c_{ij}. Then we obtain a reduced code tree, and the probability of c_{ij} is p_i plus p_j. Accordingly, the probability set becomes a reduced probability set with p_i and p_j, replaced by a probability p_i plus p_j. Let L and L prime be the expected lengths of the original code and the reduced code respectively. Then L minus L prime is equal to p_i l_i plus p_j l_j minus p_i plus p_j times l_{i-1}. Now replace l_j by l_i, because l_i is equal to l_j. Then factorize l_i to get p_i plus p_j times l_i minus p_i plus p_j times l_{i-1}. And so, we get p_i plus p_j. Now this implies that L is equal to L prime, plus p_i plus p_j. This relation says that the difference between the expected length of the original code, and the expected length of the reduced code, depends only on the values of the two probabilities merged, but not on the structure of the reduced code tree. [BLANK_AUDIO] We now prove theorem 4.17, which says that the Huffman procedure produces an optimal prefix code. Consider an optimal code in which c_m and c_{m-1} are siblings. Such an optimal code exists by Lemma 4.16 that we have already proved. [BLANK_AUDIO] Let p_i prime be the reduced probability set obtained from p_i by merging p_m and p_{m-1}. As before, let L and L prime be the expected length of the original code and the reduced code, respectively. Since L is equal to L prime plus p_{m-1} plus p_m, we see that L is the expected length of an optimal code for p_i, if and only if L prime is the expected length of an optimal code for p_i prime. Therefore, if we can find an optimal code for p_i prime, we can use it to construct an optimal code for p_i. Note that by merging p_m and p_{m-1}, the size of the problem, namely the total number of probability masses is reduced by one. To find an optimal code for p_i prime, we again, merge the two smallest probabilities in p_i prime. This is repeated until the size of the problem is eventually reduced to 2, that is, the probability set has only 2 probability masses. In this case, we know how to solve the problem, because an optimal code has two codewords of length 1, regardless of the values of the probabilities. In the last step of the Huffman procedure, two probability masses are merged which corresponds to the formation of a code, with two codewords of length 1. Thus the Huffman procedure, indeed produces an optimal code. This is illustrated in the example that we have seen before, where the two probability masses 0.6 and 0.4, are merged into a probability mass 1.