Therefore, we usually introduce logarithmic computation to replace multiplication with addition.

Specifically, we take the log (with a base of 10) value of transition probability

and emission probabilities in advance.

Okay, let’s start with a sequence!

First, let’s draw the recursive matrix of dynamic programming.

There are two states: coding state c and non-coding state n.

Next, we need to set the boundary condition

which is the default probabilistic distribution of the two states.

After taking log(log10) transformation,

the probabilities turn into -0.097 and -0.699, respectively.

Next, we will fill in the cells one by one.

First, we meet the first token: C.

From the emission probability matrix, we can see that

its emission probability under the non-coding state is -0.523.

Adding it to -0.097 will get -0.62.

Similarly, this number will become

-0.699 + (-0.699) = -1.40

under the coding state.

Next, we need to move forward by one residue, which requires one step of state transition.

From the transition matrix we can see that the transition probability here is -0.097.

Adding it to the emission probability of residue G under the non-coding state,

which is -0.523, we get -1.24.

Similarly, we can compute the [probability of] transiting from the coding state

to the non-coding state, which is -1.40 + (-0.398) + (-0.523) = -2.32.

Please note that this value is smaller than the -1.24

coming from the non-coding state, and will thus be discarded.

Similarly, we can finish the succeeding recursions and fill in all the rest cells one by one.

OK.Next we will do backtracking.

First, pick up the [cell at the end with the] largest probability.

Start from this cell and backtracking step by step...

OK，We can get the final result when I get the backtracking path.

In other words, we divided the input sequence CGAAAAAATCG

into non-coding regions N and coding regions C.

Due to time limitations, we have only created a very simple MSGP.

However it can be easily extended simply by adding more states.

The only limitation here is that the emission probabilities of different states

which are the content of residues here should differ very much.

Only in this case can we infer from the observed token sequences the state path

For example, Chris Burge developed a gene prediction algorithm, GenScan, in 1996.

This algorithm sets different states for exons, introns, and UTRs,

which improves its accuracy of prediction considerably,

making it one of the most successful gene prediction tools.

However, it does not differ much from the MSGP we have just discussed with respect to basic ideas;

it just introduces more states.

Similarly, you can also use similar methods to predict the 5' splicing sites and so on...

The HMM has provided for the data analysis of bioinformatics an effective probabilistic framework

by separating states and observable tokens.

This model is one of the most widely used algorithm models in current bioinformatics research.

We will provide you with supplementary materials

concerning more applications of Hidden Markov Model in bioinformatics.

Thank you, and that's all for this week.See you next week!