[MUSIC] So we showed how to implement loops in MapReduce and optimize them. And use that to implement the data log pattern language in recursive programs in that language. And we showed different representations of graphs, and now we're gonna show PageRank in a couple of different ways and scale. So remember the context here is that graphs are getting bigger, and bigger, and bigger. So you can imagine a social scale graph having about 1 billion vertices, one per person, with maybe 100 billion edges. Web scale graph is bigger, because there's more web pages than people. So maybe 50 billion vertices and 1 trillion edges, but there's the human connect dome. The neural network in your brain has maybe 100 billion vertices and 100 trillion edges. So, we need large scale systems to process this and MapReduce's one such system. There's more coming down the pipe though, as MapReduce's perhaps on the tail end of it's utility for these massive, massive scale, but we're gonna stick with their for now. So here's a new limitation of PageRank in MapReduce. In the map function, we have a node id and a vertex object. That vertex object has a couple of methods that we use. You can get its current PageRank, N.PAGERANK and you can get its adjacency list, N.ADJACENYLIST. So N.PAGERANK divided by the length of the N.ADJACENYLIST, gives us the fraction of the rank that we're gonna distribute to each of our neighbors. Across each of our out edges. Then we're gonna emit a special key value pair with a node id and a vertex object. And I'll come back to that in a bit. And then for each of my outgoing neighbors, send the appropriate fraction of my PageRank, which is the value p that we just constructed. Okay so now, on the reduce side, we've got a node id m, and a list of things. And we initialize a couple of values, this M object to null and a new rank to 0. And then for every p, in this list of ps, we're gonna check to see if it's a vertex. Now why that's there is that's, if it is a vertex, that means it corresponds to this key value pair that we emited up there. And all of this was, was a way to pass that complex vertex object through to the reducer, because remember the key here is just the node ID. So we need some way of passing this more complex object with some internal structure over to the reduce side. And so now we have it and we assign it to M. If it's not a vertex, then it's one of these other PageRank values that we computed and therefore we just add it up. And then finally the new PageRank for that vertex M is going to be this formula that we saw. Well sorry, I guess I'm being a little glib there, this is the damping factor that we mentioned. And this is a, equivalent expression to what we showed before. And finally now that we've constructed the new page rank, we emit a key value pair, the node ID and the vertex object itself, which is the same key value types as we need on the map side to repeat this. So we're gonna run this over, over, and over again with different cross multiple iterations. So there's some problems with this implementation. The main one is that the entire state of the graph is shuffled on every iteration, right? That little complex vertex object that includes within it the adjacency list associated with that vertex is sent across the network over to the reduce side and handled by the reducer. When all that really needs to be sent is just the new rank contributions, if the vertex state could sort of stay in place. If you could address the neighboring vertices directly and just say hey here's your new rank contribution. That would be useful, that would save some communication traffic. And then also we have to control the iteration outside of MapReduce, including the termination conditions and just the logic itself. So for these reason and just to explore another programming model, Pregel was suggested, also at Google in 2010. So there's been, since then there's been open source of implementations just like there was for MapReduce and Apache Giraph. At Stanford, there is this system called GPS, a system called Jpregel, Hama. And this is really designed for batch algorithms on large graphs in particular. So focused on graphs. And so the basic logic looks something like this. It says while any vertex is still active or the max iterations have not been reached, for each vertex process all the messages you get from your neighbors, update your own internal state, and then send messages back out to your neighbors. And then maybe set the active flag appropriately. If you don't get any messages then you are not active. Or if your internal state doesn't change it enough, you're not active and you can control this logic. And so the play is now, you're writing sort of writing a map function to reduce function, you're essentially writing just one little function, which is the function that I should run on each vertex at every time step. [MUSIC]