So in the next series of lectures will be discussing a variety of peer-to-peer systems that came out of academic research. Some of these systems have been deployed in the wild the first couple of systems will study actually the first system will study in some amount of detail is called which arguably was one of the first few peer-to-peer systems to be desired from Academia, and it's also interesting because it has several Concepts and techniques that are widely used in today's key value stores. And nosql storage systems which are very popular in cloud computing today. So a lot of you might be familiar with the data structure called a hash table hash table is data structure typically maintained inside one running process on one machine, which allows you to insert lookup and delete objects with unique keys. So you can perform these operations in essentially order one time or constant time. And a hash table essentially stores these objects in buckets based on a hash of the key and this allows you to look up and perform other operations like insert and delete fairly quickly on each key. A distributed hash table also known as a DHT is a similar data structure except that it runs in a distributed system rather than in a single process again objects. Here are files and files have unique Keys. Maybe the file name is the key. However, instead of storing the objects into buckets, you stored the objects add nodes or hosts or machines in a cluster. And again, the cluster might be distributed out throughout the world similar to the regular hash table. You have some concerns in the distributed hash table one is load balancing you want each node or each host to have about the same number of objects stored on it as everyone else. You don't want some hosts overloaded with objects While others have far fewer objects. However, I like the hash table where you can't lose a bucket but still have another in a distributed hash table. You could lose a note but still have another around so fault tolerance becomes a concern here, which means essentially that you want to make sure that even though some of the nodes might fail or my just leave the system you to churn. You don't want to lose any objects. Just like a regular hash table you want to have efficient lookups and inserts and maybe perhaps even delete operations and follow you won't have locality which essentially means that you want the messages that are transmitted among the Clusters to be transmitted. Preferably among nodes that are close by in the underlying Network topology or in terms of the internet distance underneath So Napster Nutella on fast track systems that we have discussed. So far are all kind of distributed hash tables, but they're not really because they don't really try to optimize the insert look up and delete times as we will see on the next slide cord is one of the first peer-to-peer systems which tries to directly address this problem and tries to bring the insert delete and lookup time down low enough that we can claim that it is in fact a DHT or a distributed hash table. So here's our national are compared along three metrics in terms of memory both the client and the server if applicable the look of latency and the number of messages for look up. So let's look at Napster here Napster at the client essentially does not store too much information other than the files that have been uploaded by the user. So we'll just count that but the only other information stored is the address of the server or servers that it is talking to however at the server if there are n clients in the system and each one is uploading some constant number of files. Serviced or directory information which turns out to be order n information essentially. This means that it is linear in the number of clients in the system. The lookup latency is essentially just one round trip time ignoring the time for the servers to look up their internal ternary tree is just order one. The number of messages for a lookup is also order one because it's just one round trip time. However, this the server load is fairly high. It's order n For Nutella, however, where you don't have servers where you only have this overlay graph the memory might be as high as order n if if the client does not have a limit on the number of neighbors, then you could have peers that have order an immediate one hop Neighbors in the underlying Nutella overlay the lookup latency could be as high as order n in the case of a degenerate Nutella topology that essentially looks like a line. So, you know you have one Nutella topology where all the nodes are joined in one straight line and you have a File that is located only at one end of the entire topology. So the file is located here. The file f is located here and the query for it starts from here. So there's only one copy of the file and because this is order n or n minus 1 links essentially you have a lookup latency that is order n and so the number of messages for the lookup is also 2 times n minus 1 which essentially means that it's order n as well. So order N is a fairly large number, especially when you're considering millions of nodes in the system and you really want to do better than this and that's where cord comes into play. It essentially makes all of these memory the lookup latency and the number of messages for a lookup order log n on expectation Y is log n nice. Well log in is nice because for practical purposes, it's almost as good as constant if we are taking say log base 2 log base 2 of a thousand is 10 log base 2 of a Million is 20 log base 2 of a billion is only 30 and when you consider the number of ipv4 addresses, that's just 2 power 32 log base 2 of that is just 32. So even though it's not really constant for practical purposes. It's considered to be a constant. Any case we won't push that under the carpet will still refer to that as log n so how does cord achieve this? Let's look at that in a little bit of detail little bit of history of course record was developed by researchers from both Berkeley and MIT and essentially called uses a technique where each of the nodes in the peer-to-peer overlay selects its neighbors in an intelligent fashion earlier in Nutella nodes essentially selected their neighbors based on Matrix like number of files shared number of kilobytes shared and essentially they use the Ping and pong messages here. There are certain rules that the nodes use to decide who their neighbors are. So we'll discuss the rules over the next few slides. So called uses what is known as consistent hashing and consistent hashing has different interpretations. I'll give one interpretation here and the more popular interpretation later on so cousin rehashing basically means that you take a peers IP address and port number which uniquely does identify it and you apply a well-known hash function such as the sha-1 function which stands for secure hash algorithm. This is a well-known cryptographic function. The output of this function is a 160-bit string no matter how large its input is its output is always 160 bits. And the nice thing about the hash function is that if you run this hash function on the same input no matter where you're running it no matter which hosts you turning it on. You're going to get the same 160-bit output. So you take this 160-bit output and truncated to M bits where m is a system parameter and this gives you a peer ID, which is essentially an M bit string which is an integer between 0 to 2 power M minus 1 both inclusive. Now you can hash multiple peers addresses IP address from a port number pairs using this if m is large enough, then the conflicts become very unlikely. You're not guaranteed to not have conflicts but they are very unlikely as long as m is large enough. Essentially you want to power M to be much greater than the total number of nodes are hosting your system. Essentially. What it means is that you can draw a circle which consists of 2 power M logical points running from 0 to 2 power M minus 1, and if you have a set of nodes you hash each of these nodes and you put them on their corresponding hashed peer ID on this point. So when you hash it and then you truncated to mbits you get a number that is a pure ID. So this node over here has a pure ID of 16. We're here when you hash the type your address and port number and so we'll call it as N 16. Similarly. There is another node with a parody of 32 another note of the PRT of 45 and so on and so forth. There are six nodes in this particular cluster in this system here. We are using m equals 7, which means that these numbers run from 0 to 127 over here, which is the last point before the 0 now. This is Ace ring and it loops around. So at the point to the clockwise of 127 is a zero, and then you have one and so on and so forth. Now what are the neighbors that appears maintain? The first type of neighbors are the piers maintained as shown in the figure our successors every node knows its immediate clockwise successor in the ring. So 16 knows about the IP address and port number of 32, so it can send messages directly to it 32 knows about the IP address and port number of 45. You can send messages directly to it and so on and so forth once again hundred and twelve knows about the IP address and port number of 16 so it can send messages directly to it and that completes the ring Pierre pointers or just the successor. Is of each node similarly you can have predecessors where each node knows its counter clockwise or anti-clockwise neighbor in the rain. Most of the for most practical purposes called only uses the successors. So that's the first kind of peer pointer the second kind of peer pointer, which is somewhat complex is called the finger tables, but this is really useful to route your queries fairly quickly. Now in a system where you're using M. The node has M finger table. So say we have m equals 7 and we have finger tables running from 0 through 6. So that runs from 0 to n minus 1 here is a rule for the finger table. The I think our table at the pier that has pure idn. Is the first peer that has ID immediately at or to the clockwise of n plus 2 power I but then taking modulo 2 power m. So let's do an exercise for no 1080 the value of n for 80 is just 80 right so that's well known. So let's say I equals 0 the 0 finger table entry at no 1080, which is this one over here. How did we get 96? So 80 plus 2 power 0 and plus 2 power. I That's 81. So that's a point somewhere here on the ring, right? The node immediately to the clockwise of that is 96 and 96 and that's why we say n96 is in fact the zeroth finger table entry at NAD. So NAD knows about n96 IP address and port number because the zeroth finger table entry similarly when he said I equals 1 you get 80 plus 2 power one. That's 82 again the no demeaning to the clockwise of it is 96 in keep on going that way until you reach I equals 4 when you have 80 plus 2 power 4 or 96 and again, that is 96 because 96 is And plus 2 power I now when you have I equals 5 that's 80 plus 2 power 5. That's 80 plus 32 or a hundred and twelve. And that gives you a hundred and twelve as the fifth finger table entry at node NAD now instead of having hundred and twelve if we had for instance a hundred and thirteen and a hundred and thirteen over here then n hundred and thirteen would have been the fifth finger table entry at this at no 1080, right? So you're searching at n plus 2 power. I order immediately to the clockwise of it. Now coming to I equals 6 similarly n plus 2 power 6 is 80 plus 2 power 6. That's 80 plus 64. That's a hundred and forty-four. Write this down here. So that's a hundred and forty-four, but you need to now take modulo 2 power M because hundred forty-four doesn't appear on the ring, right? So you need to take a hundred and forty-four mod. 128 And that essentially is 60. Right. And so you have this point on the ring over here, which is this point here and so n16 becomes the sixth finger table entry at no 1080. So one of the things you'll notice here, is that as you go along the ring as you increase the value of I the distance from one finger table entry to the next doubles and there is a reason for this we want to be using these for our searches so that the Searchers are fast so that they are log n and you see that in just a little bit. So that's how we place nodes on the ring. How do we place files how we decide? How do we decide where prior files get placed? So unlike Napster Nutella where clients toward their own files don't upload them by default here instead files are stored on specific nodes based on the same rules that we are using for placing the servers on the ring. In other words, you take the file name, which we assumed to be unique across the entire system. You apply the same hash function to it the sha-1 simple. Algorithm one and you get the 160-bit string then you truncate it again just like we did for peer IDs and then you map this file onto a point in the ring. Okay. So going back to the previous figure the file might map to say point thirty-four, right? That's somewhere over here. Then the file is stored at the first peer that is immediately to the clockwise or right of that point. Okay. So again, you need to take modulo 2 power M so that you wrap around the ring. So for instance if I have a file name that CNN.com slash index dot HTML this maps to a key say 42 once you hash it and truncated that's going to be stored at the first period that is immediately to the right of 42 which in our previous example would be 45. But here notice that I'm assuming unique file names and I'm using the URL as unique file name. This is kind of intentional here typically peer-to-peer systems that have been developed in the wild are used to exchange MP3 ramp xb3 and MPEG files, but this is not a limiting factor. You can use peer-to-peer systems to develop other applications such as Cooperative web caching where client browsers across a large population of clients share their web results with each other so They have faster browsing because you're able to fetch pages that have been fetched already by another client that's near to you in that case the URLs become the common become the name of the particular object and the objects here are the web pages themselves. So essentially what I'm trying to say here is that peer-to-peer systems can be used for storing any kind of object no matter what kind of objects they are as long as the objects have a unique name in this particular case. So the more popular notion of consistent hashing essentially says that with cake he's in the system and N peers in the system. Each pure will store about order will store order K Over N Keys. Okay, when I say order K Over N and this means that the number of keys at a tapir is less than C. QR n for some constant C With a high probability essentially this means that you have good load balance across the different peers in your system or the different nodes in your system. Remember that this was one load balancing was one of our goals when we started out. So we still haven't talked about how the rest of the system works. So once again pictorially here is where the file with quique 42 would be stored a 1045 immediately to the clockwise of where it maps. How does search work? Okay. So that's the next thing we need to discuss. So suppose NAD wants to search for cnn.com slash index dot HTML. The first thing it does is that it hashes it and truncates it to mbits gets 42 now. It knows that it needs to route to the point 42 on the ring or rather to the That is immediately to the node that is immediately to the clockwise of 42. How does it do this? Here is a search algorithm. And this is applied recursively so I know Den right now n is just a tea when you have a query that is destined for key K in this case k is 42. You forward the key to the largest successor or finger table entry essentially your largest neighbor or when I say large essentially means it's the most to the right wrapping around the ring that is still to the left of K. Okay, if not exist, then you send the query to your successor the second line ensures that even at the finger table entries are wrong then or if they are not present then as long as the successors are correct you end up routing the query to the correct server eventually. It might take a long time my Take N hops in the worst case but you end up rotting it to the right query. So let's ignore that second part for now, but essentially a 1080 remembered that we had the finger table entries of 9600 12 and 16. So the one among them that is the farthest to the right the farthest away clockwise from NAD that is still to the left of 42 is 16 and that's why NAD will forward the query 2 and 16 when 16 receives this it does likewise. It calculates among its neighbors which are its successors and finger table entries, which is the most of the right but still to the left of 42. do you notice that 16 will have 32 as a finger table entry because essentially 16 plus 2 power 4 is 32 but 16 plus 2 power 5 is 16 plus 32 that's 48, which means that the fifth the next finger table entry after 32 at 16 is n80 which means that 16 does not even know about in 45s existence and so 16 has only two choices 32 or 82 forward the query to and since 32 is to the left or counterclockwise of 42 it forwards it to 32 32 tries to do similarly, but it doesn't have a neighbor that is to the left of 42. So it simply forwards it to its successor. And that's where the second line comes into play and that makes its way to 45 whenever a node receives a query in addition to this algorithm. It also checks its local files to see if any of those files match and in this case forty five matches and it can respond back directly to NAD with the response. Now in this case, I've drawn three arrows here. Each of these are the Hops that the query takes these hops are essentially our PCS or remote procedure calls is a well-known. Abstraction in distributed systems. So I claim that this algorithm takes order log n time. Why is that? Well, essentially what happens is that if you consider the points on the ring whenever a query takes a step whenever it takes a hop from one node from one period to another the distance between where the query is on the ring and then and the point where the where the key is this dotted line that distance goes down by a factor of at least two Because usually I'm saying is that if this is where? So this is where the query currently is and this is where the key is on the ring then if you consider the second half of this particular segment, that's where the query is going to jump to next. Okay, and why is this this is again, you can prove this by contradiction. If this were not true if the query jump to the first half of the segment over here, then you can show because of the doubling of the finger table entries that there's going to be at least one finger table entry in the second half, which this node must know about and you reach a contradiction. Essentially because the finger table entries double this node over here is going to have at least one neighbor in this second segment, which is to the left of quique. It's going to have at least one finger table entry in the second segment, which is to the left of key K and is going to forward it to that one. So essentially after login forwarding the distance to the key decreases by a factor of 2 power log n and that's just n because we're taking log base 2 over here. And so the distance between where the query is and where the key is in terms of points on the ring is 2 power M divided by n Essentially the this means that now even if uses successors in this small section of the Ring, which has 2 power M divided by n points. You want to have you want to read you won't have the query reach the eventual key but this using balls and bins you can show that in the small segment of the Ring. There is only order log n Peers with high probability and so even if uses successors in this small section of the Ring, once you have done these login for wordings, you can only hit another order log n peers with high probability. So that's log n plus log n which is still order log n hops for the entire query to reach where the key is. Now the algorithm that I've described to you so far is essentially it's not just for searching. So so far we have assumed that is for searching but you can essentially use an algorithm to route to any key and the routing message is independent of what operation you're doing on it. So you might have the routing message Say Hey, I want to insert this key or I want to delete this key or I want to update or lookup this key doesn't matter. So routing is essentially is the building block that we have discussed so far and it can be used for any of these. hash table operations or DHT operations now so far we have said that and we have shown that the time to do a routing message to any key is order log n but this is true only the finger tables and the successor entries are correct. When could they be wrong? Well, they could be wrong when you have appears leave the system whether they fail or whether they just churn out and the corresponding successors and finger table entries have not been updated.