0:21

One of the interesting things about this RNA

is that it forms what we call secondary structure.

So if we think about RNA it has a sequence of letters over some alphabet.

And one of the interesting things it does inside the cell is that it

forms a secondary structure where it moves onto itself or it falls onto itself.

And one of the interesting questions to ask, or

that biologists are very interested in,

is trying to predict if someone gives me the sequence of the RNA, to try tp predict

how that sequence is going to fold into itself and form that secondary structure.

So this is a very interesting computational problem in biology.

That is, that is,

is of interest to the computational community because we can come up with

algorithms for predicting the structure from primary sequence or from a sequence.

So now we are going to go through this algorithmic thinking exercise to get

from this biological problem all the way to our,

to our computationally cleanly defined problem, and then.

Think about an algorithm to solve it for the biologist.

So the first question when, when the biologist comes to us and

says I want you to predict the constructor for me.

The first question is okay, what is this problem?

What is the input?

What is the output?

He says or she says that the, the input, what I have, the data I have.

Are RNA sequences, okay?

What is an RNA sequence?

1:58

This is an RNA sequence, this is another RNA sequence and so on.

So, okay, it seems like words, like strings.

What we say in computer science.

And it seems that the biologist is using special char letters here.

A, U, C, G.

I am not seeing, I am not seeing any other letters.

The biologist tells me yes.

An RNAs, an RNA sequence has only one,

each one of the letters in that sequence is just one of four letters A, U, C, G.

2:41

What is a string?

What is a string?

We can define that recursively.

Okay?

So I can, if I,

if this is a string, I can, add another letter to it, it will be a string.

I can add another letter, another letter and so on to it and it will be a string.

So if I, if, if this is my segment.

Okay? Big Sigma which is the alphabet.

I can define the set of strings recursively if I denote by Sigma* the set

of all strings over an alphabet, I can think about the base case.

3:47

W sigma is a string where sigma is in sigma.

Alright?

So if I want to apply this definition to this, I say okay, the,

if I want to start listing the elements in this, I say okay, the first one.

Base case tells me that the empty string is there.

Then I can apply the recursive step that states if you look,

if you take a string from here and start appending letters to it from sigma,

you will get, you'll get a string there.

So this notation here is what we call concatenation, and

I don't need to go into formal definition here.

Think about it as if I give you two strings, you just.

Glue them together.

Or here I'm talking about one symbol.

I append it at the end to the string.

So if I take a string and ap, if empty string.

A special case is that if you take the empty string an append any letter to it,

you just get that letter, okay?

So I can take the empty string and append A to it, I get A.

I append U, I get U.

I append C or G.

Then I can take these and I start appending letters to them.

AA, AU, AC, AG, and so on.

This is an infinite set.

I can create an infinite set.

But this is the finite representation of that set.

It is a recursive definition of this set, or

how the elements in this set are formed.

Okay?

So now I know that what the biologist is talking about is the set

of strings over this alphabet that I can define mathematically like this.

Okay?

So now I am starting to zoom in on what the biologist has in terms of data.

The biologist is telling me that the data they have are these kind of sequences.

5:30

Then I say okay, I understand now what, what the input is.

You are going to give me a sequence like this, okay?

You are going to give me a sequence and I understand what it is,

I understand what the letters are, I understand how it, how it is formed, okay?

Then say what is the output, what is it that you want from me?

He says I want you to predict how this sequence is going to fold.

On itself.

So what does it mean to fold?

It said you take take letters from here and basically match them, or

you can take this sequence and draw it such that you can,

for example, match this with this, this with this, this with this, this with this.

So in this case here, what happened is that we have A,A.

6:12

A ,A, U.

U, A, C, C, G.

Okay, so the sequence followed like this.

So this letter went with this letter, this went with this,

this with this and this with this.

So I go to the biologist.

I say.

Okay, so this is what you mean, the biologist says something similar to

this but not exactly, what don't I then go back to the biologist, what is it th at

you don't like about this, he says, the biologist says, for one thing, for one

thing, when I look at the letters that you matched together, A has to go with a U.

And C has to go with the G.

Nothing else.

I cannot put A with C, A with G, U with C, U with G, and so on.

Either A goes wi, A with U, or C with G, nothing else.

So if I look at this, the biologist doesn't like it.

A went with G, A went with C, U with C, this is only one actually,

that the biologist might like, U with A.

7:19

One other thing is that the biologist said that, I have a sharp turn here.

That, you know it's like I'm breaking the sequence and folding it under itself.

The biologist says that this doesn't happen,

they haven't observed that in biology.

What they've observed is when a sequence folds in on itself.

It actually forms a loop here.

So some letters do not match with each other.

Okay?

So, and I say okay what is the size of this loop or

what is the minimum size of this loop?

The biologist says that if you match if you

match a symbol with another symbol here.

They have to be separated by at least four positions, so

if I match this with this, I need to count 1, 2, 3, 4, so this is fine,

matching this the second with the seventh here is fine, okay?

So this is 1, 2, 3, 4, 5, 6, 7, 8.

It is fine to match 2 with 7,

it is not fine to match 4 with 5 because these two are not separated.

By at least four positions so basically the biologist says if we match

positions i and j,

then i has to be smaller than j minus 4, right?

So if I look at this is 2 and 7, 7 minus 7 minus 4 is, is 3.

So this is 2.

This is 3.

So it is true.

If the, if, if j was 6.

Okay?

So 2 smaller than which is 4.

So I cannot match 2 with 6.

So if I match i with j, i has to be smaller than j minus 4.

Okay?

So we have this which the biologist is calling [SOUND] no sharp turns.

[SOUND] And then here, this is complementarity.

[SOUND] Okay?

[SOUND] I say all of these are only conditions, the biologist says no.

The third one is [SOUND].

That this needs to be some sort of a matching.

The biologist tells me I cannot match the same letter more than once.

I cannot say this A is going to match to this G and

this A is going to match with C.

No. If I match this A to

this G I am done with this A.

i cannot match it with anything else.

What this means that.

If IJ is a matching pair of

positions, then INJ do not participate.

9:56

In any other pair.

Okay?

So this is condition number 3.

So biologists has told me A with U, C with G.

Nothing else.

They say that each two positions I put in the matching,

they have to be separated by at least four positions.

The third one that the biologist tells me is that if I choose to

match this a with this g.

This a and this g are taken, I cannot use them again.

Okay is this, is this all?

10:48

So this, the biologist says this cannot happen.

Or in other words, if I have I, J, if I tell the biologist I'm

going to match position I with position J in the secondary structure.

And I match k with with m, but

I will just say that it cannot be that i is smaller than k is smaller than j, okay?

They have to be either by smaller than j, something like this, okay?

11:20

This is fine, or it can be i smaller than k smaller than l smaller than

j which is i j and we have k l, okay?

So now the biologist has told me four conditions.

That A goes with U.

C with G.

The biologist told me that i and j.

If I tell the biologist here is your secondary structure.

I am giving you the pairs of positions that match.

The i and j should be separated by at least four positions.

The third one that if i match, if i no element,

no position appears in more than one pair.

It can either appear in one pair or none.

And the, the fourth one is that, I cannot have this crossing situation, okay?

It's either ij encompasses kl, or they are disjoint, okay?

So now, the biologist has defined for

me, the feasibility conditions, that when I take as an input, a sequence.

Which I understand what it is,

a sequence that is an RNA, a string over the alphabet A,U,C,G.

The biologists want back from me a set of pairs, you know,

IJ, KL, MN and so on, where each pair here is, is, two positions in the sequence.

To positions in the sequence, such that the letters at these twp positions for

every pf, the letters at the positions have to be either e n u or c n g,

nothing else, such that, such that for every pl, if I look at i,

n, j, k, l m and n, each pair is separated by at least four, so i is

smaller than j minus 4, k is smaller than l minus 4, m is smaller than n minus 4.

So the third condition is that if I look at these pairs, they don't share

any element in common, and the fourth one is that I have this property, that for

every two pairs, they are either disjoint like this or they are one with,

one is contained within the other.

Okay?

So now I understand what is the input.

Which is a, a string over A,U,C,G.

And I understand what is the output, it's a set of pairs with some conditions.

Okay? With some constraints.

So now we have defined the problem in terms of feasibility.

13:49

Then, is this sufficient for

me to answer, to answer the question that the biologist want?

The answer is no, because I can find, it might be the case that there will be

tons of ways, tons of such sets that will satisfy this property.

For example.

The most trivial answer, that I can give to the biologist is, the empty set.

If I give the biologist the empty set, it is satisfying this, it has no pairs, so

I am not going to violate this.

It has no pairs, I'm not going to violate this.

It has no pairs, I'm not going to violate this.

It has no pairs, I'm not going to violate, violate this property.

So I can tell the biologist I have a trivial solution for you.

14:27

If you give me a pri, you give me an RNA sequence.

I am going to give you an empty set.

It satisfies all the conditions.

Then the biologist revised their answer.

They said, oh I would want of all these possible solutions that you

can give to me.

I would want the one that has the maximum numbers of pairs.

Okay?

So if I give the empty set the biologist says I'm not happy with that because I

can find another set that has at least one pair.

And then someone else can come with another solution that satisfies all these

four conditions, but has two pairs.

And someone can come to 3, and 4, and 5, and so on.

So now, I understand what is the problem exactly that the biologist wants to solve.

The biologist wants to give me an RNA string, a string over A,U,C,G.

The biologists want a set of pairs, where each pair is a,

a pair of indices from that string, such that I satisfy these properties.

These are feasibility conditions.

And the biologist has also some optimality criterion in mind.

That optimality criterion is that.

The biologist would want the set with a maximum possible number of such

peers/g; that via, that, that, satisfy this constraint.

And this is our [INAUDIBLE] structure problem in our case okay?

So it is a problem where we're given a, a, s,

string over A,U,C,G we want to get back the output is is a set of these peers.

This set of pairs, we have to satisfy feasibility conditions,

the four conditions we listed here, and it has also to satisfy some of the criterion,

that over all possible sets of pairs that satisfy these conditions,

we would like to give a set that has the maximum number of pairs.

Okay? This is the RNA secondary

structure problem.