[MUSIC] Okay, so today I'm going to talk about modeling, the importance of modeling in computer science. And I'm going to illustrate it with a very important example from my own research area, which is computational biology. The example concerns the problem, a central problem in biology known as the sequence alignment problem, or more specifically, pair wise sequence alignment. So, what is this problem? Biologists often have this issue that they get a sequence, a DNA sequence for a certain gene from human, and also from a different organism. For example, they look at the human. And take a sequence that looks like this, ACCT. And they look at another organism, let's say, mouse. They take, what we call a homologous sequence or the counterpart of the sequence in mouse, and the sequence looks like ACTC. Okay, let's see. C another C here. No, if you look at these two sequences, we, the biologist says that I know these two sequences evolve from a common ancestor, somehow a long time ago when, when human and mouse had a common ancestor. These two sequences came from one sequence. They used to be one sequence millions of years ago. Okay, so now again from human we have ACCT and here we have ACTCC and they came from a common ancestor here. Of course we have no record for this sequence, this one, again this happened millions of years ago and we cannot just find this sequence easily, we cannot find the sample in the, in nature, that we can sequence and say, okay, this is how it used to be, right. All right, so now the biologist is interested in the following problem. Given these two sequences which biologists can easily get, I mean we can easily today get DNA from human, and DNA from mouse. They say I want you to align them. Okay? I want you to align them. And through this alignment process, basically what the biologist gets is reversing this process. When I align these two sequences, we are basically telling the biologist what type of mutations happened. Now, when a biologist tells me this, that I want you to align these sequences. I need to understand from the biologist first, is that, why are the two sequences different, of different length. The content is different also. And what does the biologist mean by aligning them. Okay? So, when I talk to the biologist, the biologist says that by aligning, I mean make them of equal length by inserting dashes in them. Okay? So, you can insert dash in any place in the sequence and align a letter with a letter or a letter with a dash and you can make them of equal length. So, for example, when I talk to the biologist, the biologist might tell me, here's an example of an alignment. ACCT- ACTCC. So, this way, basically, we are aligning this A with A, C with C, C with T, T with C. Because for this C, there's no letter, we ent, we insert a dash. So, this is a possible alignment. [BLANK_AUDIO] Okay? Now, if I look at this, what the biologist has told me, then I can go to a biologist saying, is this an alignment? [BLANK_AUDIO] I inserted a dash, but here, between the A and C. And it does, it does, it looks like an alignment because it's of the same, both of them are the same length. And I'm aligning a letter with a letter or a letter with a dash, right? And biologist say, yes this is an alignment. And then I say, is this an alignment? And the biologist say, no I don't like this because this is meaningless. Right? I'm saying there was a, there was no letter here, no letter here. And this is unnecessary to add. So, biologist tells me that no, this is not an alignment. Okay? So now I start getting some information about how to model the structure of an alignment. So, to model the structure of an alignment, given two sequences, basically I need to enter dashes in one or two of the sequences so that I can make them aligned in terms of, that is, they are both of the same length. And when I enter these dashes, then I have to make the sequences of the same length such that either a letter goes with a letter, or a dash with a letter. But no dash with dash, okay? So, this is a possible alignment, this is a possible alignment. And this is the first step in, towards answering the problem and solving the questions that the biologist is interested in, which is how do we model an alignment, what is the actual mathematical object that captures an alignment. And in this case, it is we can formalize it very cleanly, mathematically by saying, given two sequences, you can enter dashes in any of them or both. Such that you make them of the same length. And such that no dash goes with a dash, either a letter with a letter or a dash with a letter. And, of course, you cannot change the order of the letters in any of the sequences. So I cannot take ACCT and make it ACTC, for example. All right? So now, the first phase of, of modeling here was to get a notion of feasibility. What is a feasible alignment versus what's not feasible. This is not allowed. The biologist doesn't like something like this. So, now that I understand from a biologist what he or she wants from me in terms of what is an alignment, I can look at this. Now we can use mathematical skills to say, okay, what is the number of such possible feasible alignments? And I'm not going to show the number here. But believe me, the number is going to be very large. Right? In the sense that if you give me a sequence of length four and a sequence of length five, there are many, many ways of aligning them. For example, one that you are not seeing here, is, for is ACCT. They all go with dash here, and ACTCC, they all go with dashes here. This is another possible alignment. Okay? It is feasible the, both of the same length. A letter goes with a dash or a letter with a letter. Nothing is wrong with it. And I can keep adding more and more. So the first step of modelling the feasibility of an alignment. What is an alignment? That is fine. All right? And using mathematical tools now I can reason about what is the number. Once I reason about the number and find it's very large. Now I understand I need also a way to choose among these. So I can tell the biologist, here's an alignment. Here's an alignment. Here's an alignment. And I can keep listing them. But the biologist says, for example, I don't like this. Or the biologist might say, I don't like this. Now we have again to go back to the modeling phase, which is to say, why don't you like this? Or, why don't you like this? This is a modeling question. Because if a biologist tells me that they like this but not like this, I have now to start doing a mathematical reasoning about his so that I turn that like and dislike, I turn it into some mathematical measure. Say, I will define a measure that will give a better value to this, than this, okay? And once we do this, now we can turn to the algorithmic question, and say, now that I have a measure, that gives a score to each possible alignment. Now I'm interested in an algorithm that will find, of all possible alignments, the one that maximizes the score. Okay, and here, just to give an example of one what such a score might be. And this is a real example, I mean, this is what a biologist might use. For example here, for the, for the case of DNA, you can basically describe such a, scoring mechanism by a 5 by 5 matrix. And, every entry here is going to tell us what is the score or penalty of putting a letter with a letter or a dash with a letter. So, for example, the biologist might say, I, I, I like alignments that always match the same letter with the same letters. Like A with A, C with C, T with T, G with G. In terms of modeling, it means that I should give very high score to these cases, like ten, ten, ten, ten. The biologists will tell me that because of evolution and mutation can change the letter, that the biologist is willing to accept T going with C, for example, all right. But they will again still pre, pre, prefer that the sequence are written like this. So, in that case, we can give this is A, A, A going with C, for example, a score that's lower than ten, we can give it four, four, four and so on, okay. And we can use it as symmetric metrics and we can give the same thing. Then the biologists might say, you know, I don't like too many dashes, I'm going to penalize it. I'm going to see if I put dashes, the more dashes you put, the lower the score should get, so we have to penalize these things. For example, saying that if you put any letter with a dash you penalize it by three points, for example. [NOISE] And in some sense, this is minus infinity. I'm not going to allow you to put a dash with a dash. If you put it. The score of that is very bad. That we are not going to accept that alignment. Okay? So now, this is the second phase of the modelling. Which is, how do I get a score for an alignment? Or how do I evaluate an alignment whether it is good or not? By going through the modelling. By saying, the biologist has given, been giving me information about, you know, what, what do they like about A with A, or don't like about dash with C? And I turn it into a mathematical object. because once I have this, now I can say, what is the score of this alignment? Well, the answer is, A with A, that scores ten. C with C scores ten. T with C scores four. T over C scores four. Dash over C scores minus three. And if I use an additive measure, I say it's 20, 28, minus 3, 25. Okay, so now I'm going through this modeling, I have been going through back and forth with a biologist. Taken the notion of an alignment that the biologist was mentioning. It's going through modeling. I understand what is a feasible alignment, what's not feasible, then given all the possible feasible alignment, through modeling, I have some, some mechanism to score this alignment. And then the last step I would go through with algorithmic development phase. Where I can develop an algorithm that will find, of all the possible, feasible alignments, and given the scoring scheme, the one that maximizes the score. Okay. So this is an example that illustrates the importance of modeling. I couldn't have gone to solve the problem for the biologist if I didn't go through this entire process of modelling what is in alignment, what is fine, what is not fine. How do I distinguish among the fine alignments or the feasible ones? And this process of modeling is very mathematical in nature. And this is why we have been teaching you math, to illustrate, to, to be able to use this mathematical tools to go through this process