[MUSIC] So, now this top chunk gets assigned to Map Task 1 and it produces this little histogram of accounts of yellow words, red words, blue words and pink words. And, we can imagine you should think about how the code might look if you had to write this. Right, you would need to take the length, you know either go over all the words in the document just like we did before. But now instead of it needing a key being the word itself, you would count its length and put in a case statement and figure out what color it is in this notation and add that to account. So the output is these four key value pairs. And in chunk two we do the same thing and produce a different set of four key value pairs. Then in the shuffle step, all the yellow key value pairs will be grouped together, and there's two of them. All the red ones get grouped together and so on. And then in the reduce phase, you'll add these two numbers together to produce 37, okay. So the structure here is really identical to word count. It's just basically a change to the map function and in fact, in this case, you could actually literally use the exact same reduce function. All right, for every key, add up all the contributions of it. You wouldn't need to change your use at all. And this is something else you'll see is that, you know, certain functions are more general than others and you'll reuse them time and again. For example, counting things and adding things up is pretty common in these map reduce algorithms. And so reduce, general reduce functions that add things and count things come up time and again. Okay, fine. So word count, generally the canonical place to start when thinking about map reduce. Word length, very minor variation on that. So let's think of another sort of minor variation on that. And this one, inarguably is even simpler than word count. So here we want to build an inverted index. And what an inverted index is when you have a corpus of documents, you can presumably efficiently access any given document by it's name, right. So if you want to look up a URL on the web, you can do so. But for a search engine, a very primitive search engine, you might want to look up documents that contain a particular word. And so do building this index to support word lookup to provide a list of document IDs that contain that word is called an inverted index and it's one of the primitive steps in doing any kind of text retrieval kind of system. Okay. So now, you know, imagine we just had tweets instead of documents. And the input here is that the keys are these tweet ID's that I've invented, and the value is the text of the tweet itself. And so the desired output here is the word along with a collection of tweet IDs. Okay? So how do we do this? Well, again, the code is actually simpler than it was before. Because now on the reduce phase, instead of and so, well, think about it for a moment. The map phase, instead of producing each word, pancake, and 1, for an occurrence. We won't do that. Instead, we'll put out pancake, and the document ID itself. Right? And all of these guys will be produced. And so this tweet1, the map task that processes this tweet1 will put out tweet1, love, tweet1 and so on. Okay, further in optimization here is if you see the word, I guess I should've had an example of this in here. But if you see the word pancake twice in the same tweet, do you need to put it out, do you need to emit the key value pair twice? Probably not to build this index because all you're trying to record is that the tweet contains the word "pancake". Not that it,not how many times it appears. So fine, these key value pairs get shuffled across the network. And now the reduce task, what does it do? Well, it's going to get a key pancakes. This should have an s on it I guess. And it's gonna have an iterator over all the document IDs that contain pancakes, which in this case is tweet1 and tweet2. So do we need to do any processing on this key and group of values? The answer in this case is no. The reduce function is complete, is nonexistent. There's nothing to do. Right? The group is exactly what you want in this case. And this pattern actually shows up somewhat often, as well. Where you do want the map and you do want the shuffle face to do the grouping, but all you want to do is perform the grouping, you didn't actually care about the reduce function. So you're not counting these things, you're not adding them up in anyway, you're not doing any processing on the tweets, you just wanna omit that. And that's perfectly fine, because this group of values is perfectly serviceable as a value itself. Okay? So if you are used to say relational databases you know, nested structures, collections of values within a single cell in a table in a single row are generally disallowed, all right. And this is actually first normal form if you're familiar with that. But here we don't care, its any key and any value. A key can have sub-structure and a value can have sub-structure. Okay, fine. So, that is how to build and inverted index in map reduce. Let me stop there and next time we will talk about this relational join example. [MUSIC]