We are now ready to describe the soft version of Viterbi algorithm, which is called the Baum-Welch learning. After we solved the soft decoding problem, we constructed the node responsibility matrix. Which is 2 times n matrix, which we will denote as pi star matrix. Every element of pi star matrix, pi ki equal to probability pi i equal k given x. They also as we showed in the previous section, can construct edge responsibility matrix which we denote pi star star, which is equal to the probability that pi i equals l, pi I plus 1 equal k given x. And in the case of the crooked casino this pi star matrix is 4 times n matrix. Now let's assume that we will denote both matrices pi star and pi star star by single cell note uppercase pi for simplicity. So the Baum-Welch learning alternates between two steps. Given immediate strand and parameters. It really estimates the responsibility profile pi, given the current HMM parameters. And this is called the E-step of the Baum-Welch algorithm. And another step, it re-estimates the HMM parameters given the current responsibility profile. So given emitted string and pi, it derives parameters and it is called the M-step. For E-step, we know how to derive pi, because it's simply solving the solved current problem. But how do you implement the M-step, given the emitted string and pi. How can we derive parameters? When the hidden path is null, we know how to define a transformation from emitted string x and hidden path pi towards the parameter and this transformation uses estimator T lk and Ek(b) based on a path pi. The problem now is that we don't know where the hidden path pi is. The only thing that we know is the responsibility matters is pi. So our goal now is to define a transformation given emitted string and responsibility matrices derive parameters. In the case of hidden paths unknown, and our idea is to use the expected values of estimators Tlk and Ek(b) to over all possible passes to derive parameters. I will remind you how we define estimators for parameters for the case when the hidden paths is known. And the redefining estimators to prepare you for the case when the hidden path is unknown. Remember we define Tlk as the number of transitions from state l to state k in path pi. In the example of the bottoms there are six transitions from state B to state F. And we can illustrate them by the following factors that I showed to support them. That is defined as Ti lk which is equal to 1 if pi i = l and pi i+1 = k, and 0 otherwise. So we get the number of transitions Tlk by simply summing up all the value Ti lk. So I will rewrite the previous estimators for Tlk as simply sum of the estimators for Ti lk. That's redefinition of estimators for Tlk, but what about the estimator for Ek(b)? I remind you that Ek(b) is the number of times b is emitted when the path pi is in stake k. So in this case, T is emitted 6 times from the state F, and I will redefine this parameter is simply sum of 6 1s in the shown vector which is Eik(b) equal to 1. If pi i = bx and xi = b and 0 otherwise. And therefore, I will write the estimate letters for Ek(b) is simply Ek(b) equals sum of all values Eik(b). Now you will have rewritten the parameters for estimators for the parameter in the case of pass hidden pass is known, but how would you redefine this parameter? For the case pi is unknown. So our new estimators are binary parameters which are equal to 0 or 1. For example, Ti lk = 1 if pi i = l and pi i + 1 = k and 0 otherwise. To change it for the case when the passes is unknown, we simply define them as probability that pi i equals l, and pi i + 1 = k given x. Likewise, we define Eik(b) as probability that pi i = k, if xi = b and 0, otherwise. And you of course notice that these failures have been already computed. Ti lk is simply the element of the matrix pi style star and Eik(b) is simply the element of the matrix pi star. And therefore, with this parameters we now can run the Baum-Welch algorithm. The only thing we need to define is the stopping rule for this algorithm. And to define stopping rule, we can compute the probability that HMM emits the string x under the current parameters. Parameters will be changing during the course of algorithm. And we can compare the probability for the previous values of our parameter and stop if the difference in probability is small or alternatively we can stop the Baum-Welch learning after a certain fixed number of iterations. Now after we learn how these hidden marker models, it may appear that we are ready to classify proteins. But we are not quite ready yet, because proteins are usually built from multiplied independent demands. Each demand representing a conserved part of protein that often can function independently. As a result, each domain the protein has its own function, and nature uses domains as building block, shuffling them to create multi domain proteins. So instead of classifying entire proteins, it makes more sense to classify domains within proteins into family. Even so sequence similarity between domains may be very subtle that will evade the traditional sequence alignment algorithm. Therefore, to analyze protein domains, we first need to use alignments to break proteins into domains. Then after they broke proteins into domains. We can construct alignment of individual domains from a given family starting from highly similar domains from which the alignment is non controversial. Afterwards, for each family we can construct a profile HMM and estimate its parameters. Then afterwards, when the new protein comes into consideration, we can align the new protein signals against each site HMM to find the best HMM for a new protein. And that will resolve in classification of domain. To accomplish a task of protein domain classifications, biologist often use the Pfarm database of profile HMM. Each domain family in Pfam has a seed alignment which is a initial multiple alignment of domains in this family. It has HMM, build from seed alignments for new searches, for searches for newly sequenced proteins. And it also has an optional built-in full alignment, which means enlarged multiple alignment generated by aligning new domains against seed HMM. And now you're ready for the final challenge. Using the Piam HMM for HIV glycoprotein 120 constructing from a seed alignment of just 24 HIV glycoprotein 120. Using this Pfam HMM you need to construct alignment of all known gp120 proteins and identify the most diverge aged gp120 protein sequence.