Let's begin by building up some intuition about what we would want from a hash function, now that we know how hash functions are usually implemented. So let's start with a hash function which is implemented by chaining. So what's going to be the running time of our lookup, insert, and delete operations in a hash table with chaining? Well, the happy operation in a hash table with chaining is insertion. Insertion, we can just say without any qualifications, is constant time. This requires the sort of obvious optimization that when you do insert, you insert something at the front of the list in its bucket. Like, there's no reason to insert at the end. That would be silly. So the plot thickens when we think about the other two operations, deletion and the lookup. So let's just think about lookup. Deletion's basically the same. So how do we implement lookup? Well, remember when we get some key x, we invoke the hash function. We call h(x). That tells us what bucket to look in. So if it tells us 17, we know that, you know, x may or may not be in the hash table. But at this point we know that if it's in the hash table, it's got to be in the linked list that's in the seventeenth bucket. So now we descend into this bucket. We find ourselves a linked list. And now, we have to resort to just an exhaustive search through this list in the seventeenth bucket, to see whether or not X is there. So we know how long it takes to search a regular list for some element. It's just linear and the list length. And now we're starting to see why the hash function might matter. Right, so suppose we insert 100 objects into a hash table with 100 buckets. If we have a super lucky hash function, then perhaps each bucket will get its own object. There'll be one object in each of the lists, in each of the buckets, so. Theta of the list length is just theta of one. We're doing great. Okay? So, a constant, constant link to lists, means constant time insert delete. A really stupid hash function would map every single object to bucket number zero. Okay, so then if you insert 100 objects, they're all in bucket number zero. The other 99 buckets are empty. And so every time you do insert or delete, it's just resorting to the naive linked list solution. And the running time is going to be linear and the number of objects in the hash table. So the largest list length could vary anywhere from m/n, where m is the number of objects, this is when you have equal linked lists, to if you use this ridiculous constant hash function, m, all the objects in the same list. And so the main point I'm trying to make here, is that you know, first of all, at least with chaining, where the running time is governed by the list length, the running time depends crucially on the hash function. How well the hash function distributes the data across the different buckets. And something analogous is true for hash tables that use open addressing. Alright so here there aren't any lists. So you don't, there's no linked lists to keep track of. So the running time is going to be governed by the length of the probe sequence. So the question is how many times do you have to look at different buckets in the hash table before you either find the thing you're looking for, or if you're doing insertion, before you find an empty bucket in which to insert this new object. So the performance is governed by the length of the probe sequence. And again, the probe sequence is going to depend on the hash function. For really good hash functions in some sense, stuff that spreads data out evenly, you expect probe sequences to be not too bad. At least intuitive, And for say the constant function you are going to expect these probe sequences to grow linearly with the numbers of object you insert into the table. So again this point remains true, the performance of a hash table in either implementation really depends on what hash function you use. So, having built up this intuition, we can now say what it is what we want from a hash function. So first we want it to lead to good performance. I'm using the chaining implementations as a guiding example. We see that if we have a size of a hash table, n, that's comparable to the number of objects, m, it would be really cool if all of the lists had a length that was basically constant; therefore we had our constant time operations. So equal length lists is way better than unequal length lists in a hash table with chaining. So we, we want the hash function to do is to spread the data out as equally as possible amongst the different buckets. And something similar is true with open addressing; in some sense you want hash functions to spread the data uniformly across the possible places you might probe, as much as possible. And in hashing, usually the gold standard for spreading data out is the performance of a completely random function. So you can imagine for each object that shows up you flip some coins. With each of the n buckets equally likely, you put this object in one of the n buckets. And you flip independent coins for every different object. So this, you would expect, you know, because you're just throwing darts at the buckets independently, you'd expect this to spread the data out quite well. But, of course, it's not good enough just to spread data out. It's also important that we don't have to work too hard to remember what our hash function is and to evaluate it. Remember, every time we do any of these operations, an insert or a delete or a lookup, we're going to be applying our hash function to some key x. So every operation includes a hash function evaluation. So if we want our operations to run a constant time, evaluating the hash function also better run in constant time. And this second property is why we can't actually implement completely random hashing. So there's no way we can actually adjust when say we wanted to insert Alice's phone number, flip a new batch of random coins. Suppose we did. Suppose we flipped some random coins and it tells us to put Alice's phone number into the 39th bucket, while. Later on, we might do a lookup for Alice's phone number, and we better remember the fact that we're supposed to look in the 39th bucket for Alice's phone number. But what does that mean? That means we have to explicitly remember what choice we made. We have to write down. You get a list of effects that Alice is in bucket number 39. In every single insertion, if they're all from in the point of coin flips, you have to remember all of the different random choices independently. And this really just devolves back to the naive list base solution that we discussed before. So, evaluating the hash function is gonna take us linear time and that defeats the purpose of a hash table. So we again we want the best of both worlds. We want a hash function which we can store in ideally constant space, evaluate in constant time, but it should spread the data out just as well as if we had this gold standard of completely random hashing. So I want to touch briefly on the topic of how you might design hash functions. And in particular good hash functions that have the two properties we identified on the previous slide. But I have to warn you, if you ask ten different, you know, serious hardcore programmers, you know, about their approach to designing hash functions, you're likely to get ten somewhat different answers. So the design of hash functions is a tricky topic, and, it's as much art as science at this point. Despite the fact that there's a ton of science, there's actually a very beautiful theory, about what makes good hash functions. We'll touch on a little bit of that in a, in a different optional video. And if you only remember one thing of, you know, from this video or from these next couple slides, the thing to remember is the following. Remember that it's really easy to inadvertently design bad hash functions and bad hash functions lead to poor hash table performance. Much poorer than you would expect given the other discussion we've had in this video. So if you have to design your own hash function, do your homework, get some examples, learn what other experts are doing and use your best judgment. If you do just something without thinking about it, it's quite possible to lead to quite poor performance, much poorer than you were expecting. So to drive home this point, suppose that you're thinking about keys being phone numbers. So let's say, you know, I'm gonna just be very kinda United States centric here. I'm just gonna focus on the, the ten digit phone numbers inside the US. So the universe size here is ten to the ten, which is quite big. That's probably not something you really want to implement explicitly and let's consider an application where, you know, you're only say, keeping track of at most, you know, 100 or 1,000 phone numbers or something like that. So we need to choose a number of buckets, let's say we choose 1,000 buckets. Let's say we're expecting no more than 500 phone numbers or so. So we double that, we get a number of buckets to be equal 1,000. And now I got to come up with a hash function. And remember, you know, a hash functions by definition. All it does is map anything in the universe to a bucket number. So that means it has to take as input a ten digit phone number and spit as output some number between zero and 999. And, beyond that we have flexibility of how to define this mapping. Now, when you are dealing with things that have all these digits it's very tempting to just project on to a subset of the digits. And, if you want a really terrible hash function, just use the most significant digits of a phone number to define a mapping from phone numbers to buckets. Alright, so I hope it's clear why this is a terrible choice of a hash function. Alright, so maybe you're a company based in the San Francisco Bay area. The area code for San Francisco is 415. So if you're storing phone numbers from customers in your area. You know maybe twenty, 30, 40 percent of them are gonna have area codes 415. All of those are going to hash to exactly the same bucket, bucket number 415 in this hash table. So you're gonna get an overwhelming percentage of the data mapping just to this one bucket. Meanwhile you know not all 1000 possibilities of, of these three digits are even legitimate area codes. Not all three digit numbers are area codes in the United States. So there'll be buckets of your hash table which are totally guaranteed to be empty. So you waste a ton of space in your hash table, you have a huge list in the bucket corresponding to 415, you have a huge list in the bucket corresponding to 650, the area code at Stanford. You're gonna have a very slow look up time for everything that hashes to those two buckets and there's gonna be a lot of stuff that hashes to those two buckets, So terrible idea. So a better but still mediocre hash function would be to do the same trick but using the last three digits instead of the first three digits. This is better than our terrible hash function because there aren't ridiculous clumps of phone numbers that have exactly the same last three digits. But still, this is sort of assuming you're using this hash function as tantamount to thinking that the last three digits of phone numbers are uniformly distributed among all of the 1,000 possibilities. And really there's no evidence if that's true. Okay? And so there's going to be patterns and phone numbers that are maybe a little subtle to see with the naked eye, but which will be exposed if you try to use a mediocre hash function like this. So let's look at another example. Perhaps you are keeping track of objects just based by where they are laid out in memory. So in other words the key for an object is just gonna be its memory location and if these things are in bytes, then you are guaranteed that every memory location will be a multiple of four. So for a second example let's think about a universe where the possible keys are the possible memory locations, So here you're just associating objects with where they're laid in memory, and a hash function is responsible for taking in as input some memory location of some object and spitting out some bucket number. Now generally, because of, you know the structure of bytes and so on, our memory locations are going to be multiples of some power of two. In particular, memory locations are going to be even, And so a bad choice of a hash function. Would be to take, remember, the hash function takes the input of the memory location, which is, you know, some possibly really big number, and we wanna compress it, we want to output a bucket number. Now let's think of a hash table where we choose N equals 10^3, or 1000 buckets. So then the question is, you know, how is this hash function going to take this big number, which is the memory location, and squeeze it down to a small number. Which is one of the buckets and so let's just use the same idea as in the mediocre hash function, which is we're gonna look at the least significant bits so we can express that using the mod operator. So let's just think about we pick the hash function h(x) where h is the memory location to be x mod 1000 There again, you know, the meaning of 1,000 is that's the number of buckets we've chosen to put in our hash table because, you know, we're gonna remember roughly at most 500 different objects. So don't forget what the mod operation means, this means you just, essentially subtract multiples of 1,000 until you get down to a number less than 1,000. So in this case, it means if you write out x base ten, then you just take the last three digits. So in that sense, this is the same hash function as our mediocre hash function when we were talking about phone numbers. So we discussed how the keys here are all going to be memory locations; in particular they'll be even numbers. And here we're taking their modulus with respect to an even number. And what does that mean? That means every single output of this hash function will itself be an even number. Right, you take an even number, you subtract a bunch of multiples of a 1000, you're still going to have an even number. So this hash function is incapable of outputting an odd number. So what does that mean? That means at least half of the locations in the hash table will be completely empty, guaranteed, no matter what the keys you're hashing is. And that's ridiculous. It's ridiculous to have this hash table 50 percent of which is guaranteed to be empty. And again, what I want you to remember, hopefully long after this class is completed is not so much these specific examples, but more the general point that I'm making. Which is, it's really easy to design bad hash functions. And bad hash functions lead to hash table performance much poorer than what you're probably counting on. Now that we're equipped with examples of bad hash functions. It's natural to ask about, you know, what are some good hash functions? Well it's actually quite tricky to answer that question. What are the good hash functions, and I'm not really going to answer that on this slide. I don't promise about hash functions that I'm going to tell you about right now, are good in a very strong sense of the word. I will say these are not obliviously bad hash functions, they're let's say, somewhat better hash functions. And in particular if you just need a hash function, and you need a quick and dirty one, you don't want to spend too much time on it. The method that I'm going to talk about on this slide is a common way of doing it. On the other hand, if you're designing a hash function for some really mission-critical code, you should learn more than what I'm gonna tell you about on this slide. So you, you should do more research about what are the best hash functions, what's the state of the art, if you have a super important hash function. But if you just need a quick one what's, what we say on this slide will do in many, in most situations. So the design of a hash function can be thought of as two separate parts. So remember by definition a hash function takes as input something from the universe. An IP address, a name, whatever and spits out a bucket number. But, it can be useful to factor that into two separate parts. So first you take an object which is not intrinsically numeric. So, something like s string or something more abstract. And you somehow turn an object into a number, possibly a very big number. And then you take a possibly big number and you map it to a much smaller number, namely the index of a bucket. So in some cases I've seen these two steps given the names like the first step is formulating the hash code For an object, and then the second step is applying a compression function. In some cases, you can skip the first step. So, for example, if your keys are social security numbers, they're already integers. If they're phone number, they're already integers. Of course, there are applications where the objects are not numeric. You know, for example, maybe they're strings, maybe you're remembering names. And so then, the production of this hash code basically boils down to writing a subroutine that takes, as input, a string, and outputs some possibly very big number. There are standard methods for doing that, it's easy to find resources to, to give you example code for converting strings to integers you know, I'll just say one or two sentences about it. So you know each character in a string it is easy to regard as a number in various ways. Either you know just say it is ASCII, well ASCII code then you just have to aggregate all of the different numbers, one number per character into some overall number and so one thing you can do is you can iterate over the characters one at a time. You can keep a running sum. And with each character, you can multiply the running sum by some constant, and then add the new letter to it, and then, if you need to, take a module list to prevent overflow. And the point of me giving you this one to two sentence of the subroutine is just to give you a flavor of what they're like, and to make sure th at you're just not scared of it at all. Okay? So it's very simple programs you can write for doing things like converting from strings to integers. But again, you know, I do encourage you to look it up on the Web or in a programming textbook, to actually look at those examples. Okay? But there are standard methods for doing it. So that leaves the quest, question of how to design this compression function. So you take as input this huge integer. Maybe your keys are already numeric, like Social Security numbers or IP addresses. Or maybe you've already some subroutine to convert a string, like your friend's name, into. Some big number, but the point is you have a number in the millions or the billions, and you need to somehow take that and output one of these buckets. And again think of there being maybe a thousand or so buckets. So the easiest way to do that is something we already saw in a previous slide, which is just to take the modulus, With respect to the number of buckets. Now certainly one positive thing you can say about this compression function is its super simple, Both in the sense that it's simple to code and in the sense that it's simple to evaluate. Now remember that was our second goal for a hash function. It should be simple to store, it is actually nothing to store. And it should be quick to evaluate. And this certainly fits the bill. Now the problem is, remember the first. Property of a hash function that we wanted is that it should spread the data out equally. And what we saw in the previous slide is that at least if you choose the number of buckets N poorly, then you can fail to have the first property. And in that respect you can fail to be a good hash function. So if for example if N is even and all of your objects are even, then it's a disaster, all of the odd buckets go completely empty. And honestly, you know, this is a pretty simplistic method. Like I said, this is a quick and dirty hash function. So, no matter how you choose the number of buckets N, it's not gonna be a perfect hash function in any sense of the word. That said, there are some rules of thumb for how to pick the number of buckets, how to pick this modulus, so that you don't get the problems that we saw on the previous slide. So, I'll conclude this video just with some standard rules of thumb. You know, if you just need a quick and dirty hash function, you're gonna use the, the modulus compression function, how do you choose N? Well, the first thing is we definitely don't want to have the problem we had in the last slide, where we're guaranteed to have these empty buckets no matter what the data is. So what went wrong in the previous slide? Well. The problem is that all of the data elements were divisible by two. And the hash function modulus, the number of buckets, was also divisible by two. So because they shared a common factor, namely two, that guaranteed that all of the odd buckets remained empty. And this is a problem, more generally, if the data shares any common factors with N, the number of buckets. So, in other words, if all of your data elements are multiples of three, and the number of buckets is also a multiple of three, you got a big problem. Then everything's gonna hash into bucket numbers which are multiples of three, too, that's if your hash table will go unfilled. So the upshot is, you really want the number of buckets to not share any factors With the data that you're hashing. So, how do you reduce the number of common factors? Well, you just make sure the number of buckets has very few factors, which means you should choose N to be a prime number, 'kay? A number that has no nontrivial factors, And let me remind you, the number of buckets should also be comparable to the size of the set that you're planning on storing. Again, at no point did we need "N" to be, you know, very closely connected to the number of elements that you're storing just within, say some small constant factor. And you can always find a prime within a small constant factor of a target number of elements to store. If the number of buckets in your hash table isn't too big, if it's just in say the thousands or maybe the tens of thousands, then, you know, you can just look up a list of all known primes up to some point, and you can just sort of pick out a prime which is about the magnitude that you're looking for. If you're gonna use a really huge number of buckets in the millions or more, then there are algorithms you can use for primarily testing which will help you find a prime in about the range that you're looking for. >> So that's the first order rule of thumb you should remember if you're using the modulus compression function, which is set the number of buckets equal to a prime. So you're guaranteed to not have non-trivial common factors of the modulus shared by all of your data. So there's also a couple of second order optimizations, which people often mention. And you also don't want the prime; you want the prime to be not too close to patterns in your data. So what does that mean, Patterns in your data? Well, in the phone number example we saw that patterns emerged in the data when we expressed it base ten. So for example, there is crazy amounts of Lumping in the first three digits when we expressed a phone number-based ten, Because that corresponded with the area code. And then, with. Memory locations when we express, express it base two, there are crazy correlations in the low orbits. And these are the two most common examples. Either there's some digit, to the base ten representation or digits in the base two representation where you have, you know, patterns that is non-uniformity. So that. Suggests that the prime, that, N that you choose, you know, all else being equal, shouldn't be too close to a power of two, and shouldn't be too close to a power of ten. The thinking being that, that will spread more evenly data sets that do have these patterns in terms of base two representation, or base ten representations. So in closing, this is a recipe I recommend for coding of hash functions if what you're looking to do is sort of minimize program ming, programmer time, subject to not coming up with a hash function, which is completely broken. But I want to reiterate, this is not the state of the art in hash function design. There are hash functions which are in some sense better than the ones that expand on this slide. If you're responsible for some really mission critical code that involves a hash function, you should really study more deeply than we've been able to do here. We'll touch on some issues in, of the different optional video, but really you should do additional homework. You should find out about the state-of-the-art about hash function design. You should also look into implementations of open addressing in those probing strategies. And above all you really should consider cold, coding up multiple prototypes and seeing which one works the best. There's no silver bullet, there's no panacea in the design of hash tables. I've given you some high-level guidance about the different approaches. But ultimately it's gonna be up to you to find the optimal implementation for your own application.