0:37

very easy to understand and we'll, look at some examples. in a little summary from

basic theoretical computer science, there's a very important theorem called

Kleene's theorem that, was, developed actually here at Princeton in the 30's and

it says that, for any discrete finite state automaton, there exists a regular

expression that describes the same set of strings. And equivalently, for any regular

expression, there's a DFA that recognizes the same set of strings. and this is just,

an example. and if you have seen this, this sort, this sort of thing will be

familiar and if we have, if you haven't you know, and you will understand better

as we get into it. So, a discreet finite state machine is a machine that has states

that are labeled and it has transitions from state to state that are labeled with

characters. And, the way that it works is, if it's in a state and it reads like for

every state is well defined is the next character is zero or one it's another

state to go to. So for example at state zero if you see a zero you stay in state

zero, if you see a one you go to state one. In state one if you see a zero you

stay in state one, if you see a one you go to state two, and so forth. So at every

state. read a character and go to the well defined next state. That's a discreet a

deterministic final state automaton. and so, that's the machine you start it up on

a string, and it reads all of the characters in the string. and then, if it,

ends up in a state that's so called terminal state, and started on the start

state the terminal state is just in dicated by this X that are on this

example. It ends up in the right state you say that it recognizes a set of strings.

and so that's a way to determine if a given string is in a specified set. The

machine specifies the set. In an RE the, we specify the set by writing characters

and stars and meta-characters in parenthesis and this regular expression

recognizes these same set of strings. Describes these set of same, same set of

strings that's recognized by this DFA. Hank Leany's theorem showed that it's

always possible, to, construct such a machine. So that gives a, a basic plan for

developing a pattern matching implementation. and this is developed by

Ken Thompson at Bell labs, one of the developers of UNIX, and, and this facility

was an important part of early UNIX and developed this idea based on the

well-known theorem from theoretical computer science. is let's try to build a

machine, the same way as we did for Knuth-Morris-Pratt, where we have no

backup in the text input string. so we just got a finite state machine. With

Knuth-Morris-Pratt we built a, a machine for recognizing the one string how about

building one for multiple strings and that would give us a linear time guarantee. so

the unruling extractions is determine if it's FSA or DFA and the basic plan is to

just go ahead and take your pattern. It describes a set of strings and use that to

build a DFA and then simulate the DFA with the Texas input, the same way we did for

Knuth-Morris-Pratt And if it accepts, we say we have a match. If it rejects, we

don't have a match. this is a, a fine plan but it's got a flaw. the bad news is that

The plan is infeasible because the number of states in the DFA might be exponential

in the length of the RE. So, for it's got too many states Kleene's, the proof of

Kleene's theorem, standard proof of Kleene's theorem , involves the

exponential number of states. and it's not that difficult to prove if you're

interested be sure to look it up. But from a practical, standpoint, too many states,

you can't use that as a basis for algorithm. But there's an easy revision.

And again, this gets back. It will, it will give us a quadratic time guarantee,

and actually, in, in real life, it's usually linear, and all we do is change

the abstraction to use a non-deterministic finite state machine, an NFA, rather than

a DFA. so, in the same basic plan, we're going to build an NFA. It's a, it's a

different kind of machine, but actually, it's also the case that, that, oh yeah.

For any regular expression we can build on NFA so and in vice versa cleaning stand to

this and so we're going to simulate with the NFA with the text as input. And so,

what do we, what do we mean by non-determined as a finite state machine?

That's what we have to talk about next. and we'll just do it with this example

that we used throughout this lecture. So, It's very similar to the DFA that we have

before, now we're going to put the characters in the states and actually, the

kind of NFA that we're going to build, we're going to have one state for every

character in a regular expression. So this is an NFA, corres-, corresponding to this

regular expression. just, to, We, we always enclose the regular expressions in

parentheses. just to, make everything work. and then, Got this regular

expression here. A star B or ACD. then, we're going show how to build, this NFA.

and then, simulate it, to recognize the regular expression. And, how is it

different than a DFA So, there's a character in every state, and if the

character's a text character, it's the same as before. We read that text

character and moved to the next state. But it's a more general kind of machine,

because states also have what's called epsilon transition. And with an epsilon

transition the machine is allowed to change the state without scanning any

text. So, at the beginning the machine go from zero to one, to two, back to one, I'm

sorry, two to three, back to two or it go from zero to one over to six there's lots

of places the machine could go, without scanning any text character. but we do

have the black matc h transitions that scan text characters and so those are the

rules the machine operates by. And, the, final rule is when does it accept, when

does it decide a strings in the pattern. It accepts if there's any sequence of

transitions that scans all the text characters and ends in the accept case.

it's a way of specifying an infinite set of strings. but it's got this

non-determinism. It's not always determined what the machine, will do next.

It's a little bit of a mind blowing concept in theoretical computer science.

But this particular example actually shows how such, such a concept can be made

concrete and actually give us a widely useful algorithm. One way to think of a

non-deterministic machine is a machine that has super powers and can guess the

proper sequence of state transitions to get to the end. Get to the accept state.

another way to think about is the sequences if you provide us particular

sequences that's a proof that the machine accepts the text. and so this is a real

machine. We don't have a real machines that can guess the right answer but it's a

completely well specified abstract machine, and we can write a program to

stimulate it's operation, and that's what we are going to do. So let's just make

sure that everyone's got the concept down. So, say we have the question is 4A is

followed by BD, is that accepted by this NFA down here? and the answer is yes

because there is a sequence of legal transitions that ends up in state eleven.

So in this case, we'll take an epsilon transition from zero to one to two and

then we'll... We've got 4A's so we'll chew up four A's, one, tw-, I'm sorry. One, and

then go back. Two, and then go back. Three, and then go back. Four, and then

we'll be in state three. And then at that point, we'll decide to take this epsilon

transition. We'll assume your machine decides to take this one. then it

10:35

sequences that the machine might guess and go to the wrong state or stall, it doesn't

matter as long as there's some sequence and we are going to assume that the

machine always guesses the right one. So for example, if the machine just

recognizes three A's, one, two, three, and then went to state four, it would get

stuck because there's no way for it to get out of state four because state four is

looking for a B, and it's sitting on an A, and so forth. So, there's definitely,

things that the machine could do that would be wrong, but we don't care, as long

as there's some way for it to get through. and then, what about if it's a string

that's not in a language recognized by the state. the machine, well, so we have to

argue about all possible sequences and, prove that, no sequence of legal

transition, is transitions ends in state eleven. and that, seems to be, a fairly

complicated sort of argument. So in this case the machine could recognize a bunch

of A's, and then go to state four. But again, there's no B, so there's no way

it's going to get out of state four. and, so you can make a general argument like

looking at the machine that's any string that it recognizes, it has to end in D and

this one doesn't end in D. that's a, much more complicated thing, than we're talking

about, is to try to come up with, a, simple machine that will decide whether or

not a string, is in the, language that it specifies. so the question, so question

that we have is, we have this non deterministic machine, how are we going to

decide whether a given string is in the language that it recognizes. So for

deterministic machines, like we used for Knuth-Morris-Pratt it's very easy, because

at every time, there's only one possible transition. But for non-deterministic, if

you're, in some states, there's, multiple ways to go, and you have to, figure out

the right transition. But actually, the situation isn't so bad. and what we can do

is, to simulate the NFA, is just to make sure that we consider all possible,

transition sequences. And that's what our algorithm for a regular expression pattern

matching is going to be based on. that's what we will look at next.