In this module, we're gonna talk about a new concept called collision resistance,

which plays an important role in providing message integrity. Our end goal is to

describe a very popular MAC algorithm called HMAC, that's widely used in

internet protocols. HMAC itself, is built from collision resistant hash functions.

Before we do that, let's do a quick recap of where we are. In the last module we

talked about message integrity where we said that the MAC system is secure if it's

existentially unforgeable under a chosen message attack. This means that even an

attacker who is given the tag on arbitrary messages of his choice cannot construct a

tag for some new message. And then we showed that in fact any secure PRF

immediately gives us a secure MAC. And so then we turned around and said well can we

build secure PRFs that take large messages as inputs? And so we looked at four

constructions. The first construction was based on CBC, we call it when we looked at

two variants of it, one called encrypted CBC and one called CMAC. And we said that

these are commonly used with AES. In fact, in the 802.11i standard, a CBC-MAC is used

for message integrity. In particular, with the AES algorithm. We looked at another

mode called NMAC, which also converts a PRF for short inputs into a PRF that's

capable of taking very large messages as inputs. And these two were both sequential

MAC-s. We then looked at the parallelizable MAC called PMAC which again was able to

convert a PRF for small inputs into a PRF for very large inputs. But it did so in a

parallel fashion so a multiprocessor system would be more efficient with PMAC

than, say, with CBC-MAC. All three of these built a MAC for large messages by

constructing a PRF for large messages. And then we looked at the Carter-Wegman MAC

which is actually not a PRF. It's a randomized MAC so a single message could

actually have many, many different valid tags and therefore a Carter-Wegman MAC is

actually not a PRF. And if you remember, the Carter-Wegman MAC was built by

first of all, taking the bulk message and hashing it down to a small tag using a

fast one-time MAC and then encrypting that tag using a PRF. The benefit of the

Carter-Wegman MAC was that, as we said, the hashing of the bulk message is done using

a fast one time MAC. And then in this module, we're gonna construct MAC-s from

this new concept called collision resistance. And so the first thing we're

gonna do is construct collision resistant hash functions. So let's first of all

start by defining what does it mean for a hash function to be collision resistant.

So think of a hash function from some message space to a tag space T, and you

should be thinking of the message space as much, much bigger than the tag space. So

the messages could be gigabytes long, but the tags would only be like 160 bits. Now

a collision for the function H is a pair of messages m0, m1, that happen to be

distinct. However, when you apply the function H to them, you end up with the

same output. So the image you should have in your mind is essentially there are two

inputs, m0 and m1, and they belong to this gigantic message space. However, when we

apply the hash function to them, they happen to collide and they both map to the

same output in the tag space. Now we say that the function is collision resistant

if it's hard to find collisions for this function. Now this should seem a little

bit counterintuitive because. We know that the output space is tiny compared to the

input space. So, by the pigeonhole principle, there must be lots and lots and

lots of messages that map to the same output. Just because there isn't enough

space in the output space to accommodate all the messages without collisions. And

so, we know that there are lots of collisions, and the question is, is there

an efficient algorithm that finds any such collisions explicitly. So we say the, the

function is collision resistant, if, for all explicit efficient algorithms A. And

these algorithms are not able to print the collision for the function H, okay? And as

usual, we'll define the advantage as the probability that the algorithm A is able

to output a collision. Now I'm not gonna formalize a term explicit here. All I'll

say is that it's not enough to just say that an algorithm exists, and algorithm

that simply prints a collision. Cause we know many collisions exist. What we

actually want is an explicit algorithm that we can actually write down and run on

a computer to generate these collisions. There are ways to formalize that, but I'm

not gonna do that here. Now a classic example of a collision resistant hash

function is SHA-256 which happens to output at 256 bits but can take

arbitrary large input. For example, it can take gigabytes and gigabytes of data and

it will map it all to 256 bits. And yet nobody knows how to find collisions for

this particular function. So just to show you that this concept of collision

resistance is very useful, let's see a very quick application for it. In

particular, let's see how we can trivially build a MAC given a collision resistant

hash function. So, suppose we have a MAC for short messages. So you should be

thinking something like AES, which can MAC sixteen byte messages. And then, suppose

we have a hash function, a collision resistant hash function from a large message

space, that contains gigabyte messages into our small message space. Say, into

sixteen byte outputs. Then, basically, we can define a new MAC, let's call it Ibig,

which happens to be MAC-ing large messages. And we'll define it simply by

applying the small MAC to the output of the hash function. And how do we verify a

MAC? Well, basically, given a tag we verify it by rehashing the given message

and then checking that small MAC actually verifies under the given tag. Okay, so

this is a very simple way to show you how collision resistance can take a primitive

that's built for small inputs and expand it into a primitive that's built for very

large inputs. And in fact we're going to see this not just for MAC-s. Later on when

we talk about digital signatures, we're going to do the same thing. We're going to

build a digital signature scheme for small inputs and then we're going to use

collision resistance to expand the input space and make it as large as we want. So

the security theorem basically isn't something that's trivial here. Basically

it says if the underlying MAC is secure and H is collision resistant, then the

combination which can actually MAC large messages, is also a secure MAC. And as

a quick example, let's apply this to AES. So let's use the one example that we

mentioned, SHA-256. So in this case SHA-256 outputs 256 bit outputs,

which is 32 bytes. So we have to build a MAC that can MAC 32 byte messages. And the

way we could do that is basically by applying the sixteen byte AES, plugging it

into a two block CBC. A two block CBC would expand AES from a PRF on sixteen

bytes to a PRF on 32 bytes. And then take the output of SHA-256 and plug it into

this two block CBC based on AES. And then we get a very, very simple, MAC which is

secure assuming AES is a PRF and SHA-256 is collision resistant. So one thing I

wanted to point out is that in fact collision resistance is necessary for this

construction to work. So in fact, collision resistance is not just a made-up

term. Collision resistance really is kind of the essence of why this combined MAC is

secure. And so let's just assume for a minute that the function H, the hash

function we're using, is not collision resistant. In other words, there is an

algorithm that can find two distinct messages that happen to map to the same

output. In this case, the combined MAC. Is not going to be secure because what the

adversary can do is simply use a chosen message attack to get a tag for m0. And

then output m1 comma that tag as a forgery and indeed T is a valid MAC for m1 because

H(m1) happens to be equal to H(m0). And so in doing so just with a one chosen

message attack the attacker was able to break this combined MAC simply because the

hash function was not collision resistant. So it should be, again I want to emphasize

that if someone advertised even one collision for SHA-256, you know two

messages, just one pair of messages that happen to have the same output under

SHA-256 that would already make this construction insecure cause an attacker

could then ask for the tag on one message and in doing so he would obtain the tag on

the other message as well, and that would count as an existential forgery. Okay, so

already, we see the collision resistance is a very useful primitive, in that it

lets us expand the input space of cryptographic primitives. I wanna show you

one more application where a collision resistance is directly used for message

integrity. Imagine again, we have files that we wanna protect. Let's imagine these files

are actually software packages that, we wanna install on our system. So here are

three different software packages. You know, maybe one is GCC, on is Emacs, and

another one is, I don't know, VI. Now the user wants to download the software

package and he wants to make sure that he really did get a version of the package

that he downloaded and it's not some version that the attacker tampered with

and modified its contents. So what he could do is basically refer to a read-only

public space that's relatively small. All it has to do is hold small hashes of these

software packages. So there isn't a lot of space needed here. The only requirement is

that this space is read-only. In other words, the attacker cannot modify the

hashes stored in this space. And then, once he consults this public space, he can

very easily compute the hash of a package that he downloaded. Compare it to the

value in the public space. And if the two match. Then he knows that the version of

the package he downloaded is, in fact, a correct one. Why is that true? Well,

because the function H is collision resistant. The attacker cannot find an F1

path, they would have the same hash as F1. And as a result, the attacker cannot

modify F1 without being detected because there's no way that the hash of his F1

[inaudible] would map to the value that's stored in the public space. So, the reason

I brought up this example is, I wanted to contrast this with the MAC example. We

kinda saw a similar situation with MAC-s. In the MAC example, we needed a secret key

to verify the individual file tags. But we didn't need a resource that was a read

only public space. With collision-resistant hashes, we kind of get

the exact compliment where in fact we don't need a key to verify. Anyone can

verify. You don't need a secret key for it. However, now all of a sudden we need

this extra resource which is some space that the attacker cannot modify. And, in

fact, later on, we're gonna see that with digital signatures, we can kind get to the

best of both worlds, where we have both public verifiability, and we don't need a

read only space. But so far, with either MAC-s or collision resistance, we can have

one, but not the other. So, I'll tell you that, in fact, this kind of scheme is very

popular. In fact, Linux distributions often use public space where they

advertise hashes of their software packages. And anyone can make sure that

they downloaded the right software package before installing it on their computer. So

this is, in fact, something that's used quite extensively in the real world. Okay,

so in the next segment we'll talk about generic attack on collision resistance and

then we'll go ahead and build collision resistant hash functions.