0:03

As usual, it's very instructive to take a look at the simplest brute force algorithm

for the problem. We'll look at that in a little detail,

because it illustrates, really what, fundamental issues involved with getting

efficient algorithms for this. And it's also the basis for more efficient

algorithms. So the brute force method, we could give

in a beginning programming class. You have your text, you have an index I

that index as in to text, and you have an index j that indexes into the pattern.

And, start out with both i and j at zero, and you compare text to pattern, and you

keep going until you find a mismatch. If you find a mismatch before the end of

the pattern, then what you do is move the pattern over one position,

That corresponds to simply incremental the text pointer.

And then, here we have a mismatch right away, so we move the pattern over one

position, increment the text pointer. And then starting with I at two, we

compare the second text character with j of zero, the first pattern character.

Increment j to one, increment i to three, we have a mismatch.

Mismatch happens at j and +J, j. the jth character the pattern versus the iJ +

character of the text. Now we have a mismatch, move over.

That's increment i to three, compare the first character of the pattern, j = zero

would be the i plus j pattern of the text. That's a mismatch.

Move over one. So we go first one is a match.

The second one j of i is four compare against five, that's a mismatch.

Move the pattern over one, that's a mismatch.

Finally get to i = position six, and we go for j equals zero, one, two, three, all

matches. When j gets to four, which is the number

of characters in the pattern that's when we know we've found a match.

So, in this table, the entry's engraved, just there for a reference.

The black ones are where when we found matches, and the red ones are where we

found mismatches. So, that's, we do the trace before the

code, because the code is extremely simple.

So this is, to implement brute force substring search, to look for a pattern, a

given pattern, in a given text, or for job as index of, this woud be, on the index of

method in, string, takes the argument pattern. So we get the pattern length, and

we get the text length. And we're going to potentially look at

every, straight, the pattern could start at any position in the text from zero to N

- M. And for every value of i, we've create a

new j and we look for the match between the pattern in position j and the text

character of position + j. If we get all, all the way to j = m, then

we found a match in i. If we get a mismatch, then we get a break

before j = m, and we go and try the next value of i.

And if we get all the way through there without returning, then we just return N,

which is one past the index of the last text character, the length of the text.

And that's an indication that the pattern was not found in the text.

Very straightforward implementation of the pattern match algorithm.

Now, the key point is that, This algorithm is fine in, in many

contexts, and actually it's the one that Java's index of uses.

But the real problem is that it can be slow. Number one,

Just the algorithmic problem, It can be slow if there's a lot of,

repeat, repetitive characters in the text and the pattern.

For example, suppose, that the, text is, all A's and a B, and imagine that there is

a million A's and a B or whatever. And then the pattern also was a smaller

copy of the same thing, all A's and a B. Then what happens in this case is that for

every possible position matching the pattern against the text, we go through

all the pattern characters and only find the mismatch on the last one.

So, And then we find a mismatch, we have to go

over one and try them all and find the mismatch in the last one.

And we eventually do find a match, but for every text character we've looked at

almost M - one pattern characters. So, this is a worst case that shows that

the running time of the brute force algorithm can be proportional to M N,

where M is the pattern length. And the pattern could be long, say that

the pattern could be hundred characters and N can be huge, like a billion and

that's going to be slow, Particularly by comparison to the

alternatives that we're going to look at. Now a more important issue than just the

worst case performance is the idea of backup.

As I mentioned, for lots of applications, if we're going to put our machine in, on

one of the wires of the Internet and watch the input go by or if we just take the

abstract standard input model, you don't get to go, you don't get to back up when

you find a mismatch but the brute force algorithm is always backing up.

If we go through, matching our pattern against our text,

When we find a mismatch, We say we want to move the potential

pattern over one position, But that means backing up in the text.

So, we would that's, ways to deal with that by, saving the most recent M text

characters that we've seen. But it's definitely problematic for larger

patterns and certainly inconvenient. so the brute force algorithm uses backup.

And so you could maintain a buffer as I mentioned,

But what we're going to look at is an ingenious way couple of ingenious ways to

avoid having to do backup. To setup for that,

We're going to look at, A slightly different implementation or

alternate implementation of brute force. It, it,

Compares the same characters as the previous, implementation.

But, it, does things in a slightly different order, without, without a second

for loop. And it does the explicit backup, so let's look at that.

So, we have our i and j pointers, and, and we initialize them both at zero.

And so it's a for loop where, I gets incremented on every iteration through the

loop. And so what we do is, as long as we see a

match we also increment J, so that while the pattern is matching we're implementing

both I and J. When we see a mismatch what we do is just

subtract. The current value of j from i the pointer.

That's the next, character that we have to look at.

So, when we find this mismatch, we wanna, subtract the current value of J.

Then increment I at the end of the four loop.

And that puts us right on the, next, text character that we wanna look at and then

we reset j to zero. So this, does the same, Character

comparisons. But it explicitly shows that we're backing

up in the text. And then, if we ever get to the end of the

pattern then we return I minus M. Which is the position of the first

character in the text that matches the pattern which we found there was M

matches. So it's an alternate implem,

implementation that will come back to. So, the ideas that there are a couple of

out rhythm challenges even though this brute force method, is simple it is not

always good enough. And so the first one is just, from a

purely algorthmic stand point. This is a challenge.

Do we need N M? Or M is the length of the pattern?

Or can we do it in time independent of the length of the pattern?

What we want is a linear time guarantee. And that was a fundamental problem

algorithmic problem that people worried about.

And for a, for a, for a bit and we're going to look at the, how people approach

this fundamental algorithmic problem. And then there's the practical challenge

of we don't, we might not have the room or the time to save the text and actually the

judge might not be happy about us saving text away in some computer.

So we want to avoid backup in the text stream.

All we're supposed to know about is our pattern.

And we're supposed to light the light if we find our pattern and that's it.

So what we're gonna see next is a way to deal with both of those challenges at the

same time.