0:00

. And that's still not the end of the story.

We're gonna look at one more really interesting al-, algorithm that has, lots

of important applications, called a Rabin Karp Algorithm.

Invented by two Turing award winners. Michael Rabin and Dick Karp.

I can remember hearing about this algorithm, from my friend, Dick Lipton,

who explained it to me over the phone, and, I, he explained it to me in about,

fifteen seconds, and I realized I had to have this, in the book.

And, so now, here we are, presenting it. Th, that was, in the 70's.

So the basic idea for the Rabin-Karp algorithm is, has to do with hashing.

In it's a particular kind of hashing, called modular hashing.

It's just a particular way of computing a hash function.

It's easiest to think about in terms of numbers, although it, it works in all

kinds of situations. Because remember everything in a computer

is encoded as a byte, which can be treated as bytes which could be treated as binary

numbers. And so what we are going to do is in this

case We saw a pattern characters are decimal digits.

And so, we'll treat a sequence of pattern characters as the decimal number.

And modular hashing is, just take a big prime.

And compute the remainder when you divide your number by that prime.

So in this case, 613 is the remainder that you get when you divide 26,535 by 997.

So you can check that. So that's what we're going to use as the

hash function. And that.

This type of hashing is widely used. You have a prime number, we talked about

it when we talked about hashing. It satisfies, it seems to satisfy

something like the uniform hash assumption under various circumstances.

So that's our pattern, a five character pattern, and we're going to keep the small

hash values 613 and this is going to generalize to longer patterns, and we'll

talk about that in a minute. So now, suppose we have this text.

And our pattern happens to occur here in the text.

And what the method is built on is the idea of you take the first five characters

in the text and compute its hash value. In this case, 31,415, mod down 97, is 508.

So that's different so that's not the pattern.

Maybe take the next five characters, that's 201. That's diffent It's not the

pattern. Take the next one.

That's 715, different it's not the pattern 15,926 by 97 is 971, it's not the pattern.

Eventually, when you have the text characters that are the same as the

pattern characters you're gonna get the same result, it's a match.

If the pattern hat, hash equals the text sub-string hash you, you have the

potential for a match. And that's what the algorithm is based on.

Now it seems like, we're doing a lot of calculation, with, making numbers out of

these things. And, and keep doing modular arithmetic on

it. But actually there's, a really, simple way

to, severely limit the amount of calculation.

And give a quick linear algorithm for, search.

Sub-string search. So first thing is how to compute the hash

function. So we take the, just convert the Math.

So R's our Radix. So in this example, we're using ten so we

have decimal numbers. And then the digits, say t's of i That's

the text characters. So we have a number x of I, which is the,

4:19

So you know, in this case that's two<i>10000+6<i>1000+5<i>100+3<i>10+5 that's just</i></i></i></i>

Math for that. And our goal is so it's an N digit base

based our integer modular Q And our and our goal is to do the math.

That gives us the remainder that we would get when dividing that by Q well there's

really easy method called Horner's Method that we can use to evaluate a degree in

polynomials just with a multiplied M multiply and add. And we can do the

modular computation all the way through at each step, to keep the numbers less than q

and we still get the same result. And so the idea is, you multiply by R.

You go from left to right through the digits and you just multiply by R and add

the digit, and then do mod q at every time.

So we start with two mod 97 is two. To six mod 987 is two<i>10+6 mod 987 and</i>

that's 26. And then I take that value.

Multiply by ten and add five that's 265 mod 997.

In that case it's, it's 265. So 265<i>10+3 is 2653.</i>

Our remainder is divided by 997, it's 659, so even though our number gets bigger than

997, might take them out every time, we keep our running total less than 997.

And then the last step is to take the 659. Basically we've thrown out a bunch of

multiples of 997 that we don't care about. And 659<i>10+5 mod 997 is exactly equal to</i>

26535 mod 997 and that's 613. That's our value.

6:31

So that's A using Horner's method we got a, well known linear time method to do

compute or hash function with this simple code.

And this notice will work even for a huge key that we wouldn't compute a hundred

dig, convert a hundred digit key in to some number to do the calculation.

We do one digit at a time using Horner's method and then we have no limit because

we're always keeping our numbers less than our prime queue.

So that's a first step, so no matter how big the pattern is,

We can efficiently compute a hash or, since that is the first step.

So now the second step for the Rabin Karp algorithm is to realize that if we know xi

mod q we can efficiently compute xi+1 mod q cause they have a lot of digits in

common. And you can just do a little Math to get

to xi+1 you take. Xi, we don't care about the first digit anymore, so you subtract

it off. Multiply by r and then add the new digit.

That's like one step of Horner's method. Now, then you have to take that

computation and you can do mod q all the way through.

All you have to do is pre-compute r to the n+1 mod q And so here's the computation

for one example. If we're at this position 41592 and we

know 41592 mod q we can compute 15926 mod q by subtracting off 40,000.

The TiR-1 and that gives us just the four digits, multiply by the radix add the new

trailing digit and that's the new value. And if we just keep that all mod q then we

can with just a multiply and an add at each step we can keep a running total of

the Modular hash value of the five digit thing.

So, for example, this is the case that, that we just did 4152 on 997 is, is done

by ex, exactly as we said we subtract and then add and then multiply by the radix

mod 997. So, doing those calculations all the way

through the search, we eventually get to a match.

That's again remarkably small amount of code.

We're going to keep a long random prime. Just keep it a little smaller than the

biggest long value to avoid overflow. So we pre-compute r to the m - one mod q

'cause that's the little calculation that we have to do.

We compute the hash function. And for the pattern.

And then, with those pre-computations, the search is extremely straight forward.

So we take our current hash value. And this is just, add a q make sure it's

positive. And subtract off rm times the first

character in then add in the next character mod q.

And that gives us the text hash for the current position.

And then we compare to see if that's equal to the pattern hash.

Now there's this is an introduction to the idea of randomized algorithms.

There's two ways to proceed from here. One way called Monte Carlo version.

Where we guaratee that the algorithm is gonna be quick but with low probability,

it might get the answer wrong. In that version we don't ever bother to check

whether the go through and check all digits to see if there's actually a match.

11:14

We take queue large enough so that we're confident the probability of the, to, two

digit numbers or two m digit numbers having the same hash value.

And so low that we're not gonna worry about it, that's called the Monte Carlo

version. The Las Vegas version is guarateed to get

the right answer. In that one we would go and check to make

sure that the M characters match if we have a hash match.

And then if it could be that with such at a low probability could be that there's a

hash match but not a substring match. Then we're just.

Move on. And from a theoretical point of view

there's some very extremely low possibility that one could be slow.

But lets look over at what the analysis says.

So the theory says that if you take a sufficiently large random prime say mn^

three So a long value maybe you can get that and remember n is huge.

Then the probability of a false collision is about 1/n. so you know in a billion

things you might get, you might get 1/n, you might get a false collision.

So in practice we choose actually q just to be, there's no reason not to choose as

large as we possibly can, not related to m and n.m And then the probably collision is

going to be about 1/q. So we're going to take it to be like the biggest law, and

that means that probably collision is extremely small.

And then you can take your chances. You can do a Monte Carlo version where you

just say I got a match because I got a hash match.

And be confident in the laws of probability.

And not worry about the client getting the wrong answer.

Or you can have the Las Vegas version where you go, go ahead and return the

correct answer. And. And be confident that your client's

not gonna run into a slow case cuz the probability is so, so tiny, 1/q, that you

don't have to worry about it. That's the Rabin-Karp algorithm.

Now, why look at this algorithm? It's linear time.

We have other algorithms that are linear time.

One of the key reasons to, be interested in Rabin-Karp is that it's easy to extend

it to more complicated situations. So, say you wanna look for one of, several

different patterns. Well, you just compute the hatches for

those patterns, and then look for any one them Use, a symbol table, to look for'em.

So, that's a much more, general capability than, we can provide with the other

methods. It also can be extended to do

2-dimensional search and other things like that.

For straight suffering search, it's gonna be a little slower because, there's,

interloop it's kind of long, the arithmatic operation, are gonna be a

little slow, if you wanna do the Las Vegas version you have to back up the text and

you have this, Monte Carlo Las, Las Vegas thing, and you should think about, writing

code to extend it to look for any one of p possible patterns, thats, an interesting,

Algorithmic puzzle as I mentioned is not so difficult to solve.

So here's our summary. We started with a brute force algorithm,

and although typically you don't have this worst case thing.

It works fairly well for typical cases. And then we've got the Knuth-Morris-Prath

Method that can guarantee linear time and has no backup.

And maybe uses extra space unless you use Pratt's version.

The Aboriamor who can get the running time down to n/m which is quite an amazing jump

and quite useful and in a Rabin Karp that's very flexible and extends through

all these other situations. This is a nice mac, microcosm of

algorithmic technology where, really interesting, and unique and path breaking

algorithmic ideas give us. A good algorithm, for even such a simple

problem. That's an introduction to pattern