In section 7.4, we will prove the achievability of the channel coding theorem. Here are the steps for proving the achievability. First, we consider a DMC with transition matrix p(y|x). For every input distribution p(x), prove that the rate I(X;Y), the mutual information between the input and the output, is achievable, by showing that for large n, the existence of a channel code such that the following two conditions are satisfied. First, the rate of the code is arbitrarily close to I(X;Y). Second, the maximal probability of error, lambda_max, is arbitrarily small. Then choose the input distribution, p(x) to be the one that achieves the channel capacity. That is the mutual information between the input and output of the channel, is equal to the channel capacity C. Lemma 7.17 is a key lemma for proving the achievability. Let the pair of sequences, X prime and Y prime be n i.i.d copies of a pair of generic random variables, X prime, Y prime, where X prime has the same distribution as X, and Y prime has the same distribution as Y, and X prime and Y prime are independent. Then the probability of the pair of sequences, X prime, Y prime, is jointly delta typical with respect to X and Y, is less than or equal to 2 to the power minus n times the mutual information between X and Y minus tau, where tau tends to 0 as delta tends to 0. The idea of this lemma is the following: Consider generating a sequence X prime, in an i.i.d. fashion, according to the distribution p(x), and generating another sequence, Y prime, in an i.i.d. fashion, according to the distribution p(y), where the generation of X prime and Y prime are independent. The lemma says that the probability, that the pair of sequences, X prime and Y prime, generated as such, is jointly typical, with respect to the generic pair of random variables, X Y, is upper bounded by 2 to the power minus n, times the mutual information of X and Y. We now prove the lemma. Consider the probability in question, that is, the probability that the pair of sequences X prime and Y prime, being jointly delta typical, with respect to X and Y. This probability is equal to the summation of p(x) times p(y), over all (x,y) pair that are jointly typical with respect to the generic pair of random variables X and Y. Here, the probability of generating a particular pair of sequences, x and y, is equal to p(x) times p(y) because the random sequences X prime and Y prime are independent of each other. By the consistency of strong typicality, for a pair of sequences, (x,y), that are joint typical with respect to X and Y, the sequence x is typical with respect to the generic random variable X and the sequence y is typical with respect to the generic random variable Y. Therefore, by the strong AEP, all the p(x) and p(y) in the above summation satisfies p(x) less than or equal to 2 to the power minus n times entropy of X minus eta, and p(y) less than or equal to 2 to the power minus n, times entropy of Y, minus zeta, where eta, and zeta, tends to 0, as delta tends to 0. By the strong AEP, the size of the jointly typical set with respect to X and Y, is less than or equal to 2 to the power n, times entropy of X and Y plus psi, where psi tends to 0 as delta tends to 0. From step 1, we have the probability that the sequence X prime and Y prime being jointly typical with respect to the pair of generic random variables X and Y, being equal to the summation of p(x) and p(y) over all (x,y) pairs that are jointly typical with respect to X and Y. In this summation, p(x) is less than or equal to 2 to the power minus n times entropy of X minus eta, p(y) is less than or equal to 2 to the power minus n times entropy of Y minus zeta. And the number of terms, in the summation, is less than or equal to 2 to the power n, times entropy of X and Y, plus psi. And so, the upper bound becomes 2 to the power, minus n, times entropy of X, plus entropy of Y, minus joint entropy of X and Y, minus psi, minus eta, and minus zeta. Now entropy of X plus entropy of Y, minus joint entropy of X and Y is simply the mutual information between X and Y. Upon defining tau equals psi plus eta, plus zeta, we see that this upper bound, is given by 2 to the power minus n, times the mutual information between X and Y, minus tau, where tau tends to 0, as delta tends to 0. This proves the lemma. Here is an interpretation of Lemma 7.17. Consider this strong joint typicality array for two generic random variables X and Y. Recall that the rows are the typical x sequences such that there exist, at each one y, which is jointly typical with that x. And the columns are the typical y sequences such that there exists a sequence x which is jointly typical with that sequence y. Now the generation of the random sequence X prime, is approximately equal to randomly choosing a row in the above array. And a generation of the random sequence Y prime, is approximately equal to choosing a column in the above typicality array. The fact that the sequences X prime and Y prime are independent of each other, corresponds to choosing a row and a column independently. Then the probability that the pair of sequences X prime and Y prime, being jointly typical, with respect to X and Y, is approximately the same as the row and the column that we have chosen corresponds to a pair of sequences x and y, that are jointly typical with respect to the generic pair of random variables X and Y. And this probability is approximately equal to the number of dots in the array, that is two to the power n times entropy of X and Y, divided by the number of rows in the array, which is approximately 2 to the power n times entropy of X, multiplied by the number of columns in the array, which is approximately 2 to the power n times entropy of Y and so this probability is approximately equal to 2 to the power minus n times the mutual information between X and Y. We now describe a coding scheme constructed by a random procedure and so it is called a random coding scheme. First we fix the parameters of the coding scheme. Fix epsilon greater than 0 and then input distribution p(x). Let delta to be a small quantity specified later. Let M, the size of the message set, be an even integer satisfying 1 over n, log M, greater than the mutual information between X and Y, minus epsilon over 2, and less than the mutual information between X and Y, minus epsilon over 4, where n is sufficiently large. Note that once the input distribution p(x) is fixed, the mutual information between X and Y is determined. With this choice of M, its value is approximately equal to 2 to the power n, times the mutual information between X and Y. Recall that one over n times log M is the rate of the code. The choice of M has the following meaning. The lower bound here, guarantees that the rate is very close to the mutual information between X and Y. On the other hand, the upper bound guarantees that the rate is not too close to the mutual information between X and Y, otherwise the coding scheme would not work. We now describe the random coding scheme. First, construct a codebook C, of an (n,M) code, by generating M codewords of length n, with alphabet X independently and identically according to p(x) to the power n. Denote these codewords by tilde{X}(1), tilde{X}(2), all the way to tilde{X}(M). Here is an illustration of the M codewords. Each component is generated i.i.d. according to p(x). There are a total of the size of X to the power M, times n, possible codebooks that can be constructed, where M times n is the total number of components of all the codewords. And we regard two codebooks whose set of codewords are permutations of each other as two different codebooks. Then we reveal the codebook C to both the encoder and the decoder. A message W is chosen from the message set, according to the uniform distribution. Then transmit the sequence X, which is equal to the codeword for the message W, through this channel. The channel then outputs a sequence Y according to the probability that the received sequence is Y given that the transmitted sequence is X, is equal to the product from i equals 1 up to n, p(y_i|x_i). The received sequence Y is decoded to the message w if tilde{X}(w), that is the codeword being sent and the received sequence Y are jointly delta typical with respect to X and Y and there does not exist another message w prime, such that tilde{X}_{w prime}, that is the codeword for message w prime and the received sequence Y, are jointly delta typical. If these two conditions are not satisfied simultaneously, then the received sequence Y, is decoded to a constant message in the message set. Denote by W hat the message to which the received sequence Y is decoded. This completes our description of the random coding scheme.