Well, we're now in a position to state the final algorithm for a proximal point query and hence the count sketch algorithm. The algorithm proceeds by first choosing two parameters. We choose the parameter t which is the number of repetitions of the hashing scheme that we'll perform to boost the confidence of our estimates. And one should think of t as logarithmic in the length of the stream. We will set these parameters later in this lecture. The other parameter is b, which is the number of buckets that we will hash our universe of data items into. And b should be thought of as a large constant times k, the number of heavy hitters that we would like to approximately find. So the algorithm starts by selecting t uniformly random hash functions that map our universe of data items into b buckets. And also, t uniformly random sign functions that associate uniformly random signs with every data element in our universe. Now this can be conveniently visualized as an array C shown on the slide. This array contains t rows, corresponding to the t independent repetitions, will index the rows by a parameter r, little r. And b columns, every column corresponds to a hash bucket Now the pseudo code for the algorithm is as follows. The algorithm will maintain this array C of numbers, of counters and every cell in the array C will correspond to an invocation of the basic approximate point query primitive that we just designed. Specifically, the update process looks like this. During a data item C, data item i, we run over all the rows r from 1 to t. And for each row r, we use the rth hash function to figure out the column that our element i has to contribute to. And we increment the corresponding counter, that is, the cell in r throw and H of ith column, by the sign that the sign that the rth sign function assigns to our element i. Now the pseudocode is given on the slide and it should be noted that this simply amounts to running for each row r our simple approximate point query primitive that we just designed on the stream of data items that hashed into the same bucket as i itself. Okay, so this is the update process. Now the estimation process works as follows. At the end of the stream, if we are given an item i whose frequency we want to approximate, what we do is we run over all the r rows in our hash table. And for each row, we take the product off the counter in row r, and the column that I hash to with a sine of r, under the r sine function, which is exactly the estimator that we designed for our basic approximate point query primitive. Except that now we have r independent copies of these approximate primitives, and we take the median of the answers at the output as the final result. So for example, the data item 5 hashes into three distinct buckets in every of the three rows of array C shown on the slide. So it will contribute to these three buckets when we perform the update process. And then the answer will be computed as a median of the three values found in these buckets, multiplied by its sine. Okay, well, this is the algorithm. Now the question is why does this algorithm provide precise estimates for every single item in the stream with high probability. Well, the first observation here is that, by our previous analysis of the basic estimation primitive, basic version of approximate point query a few slides back, the mean of every of these t estimators that we use, that we take the median of in our new version, is exactly correct. The mean of every of these estimators, for every r between 1 and t, is exactly the frequency of the i'th item. Okay, well, this is reassuring, but as of now this doesn't mean that our algorithm is any better than the basic primitive that we designed before. The distinguishing factor comes in the variance of our estimate. Recall that the variance of the basic primitive that we designed before was equal to the summation over all data items j that were different from i of the frequency of the jth item squared. Now it is possible to show that the variance of our new estimator is the summation only over those data items j different from i that hashed to the same bucket as i of their frequency squared. Now our hash functions map the data items to b buckets, so we should expect the variance to be a factor of one or will be smaller than before. And indeed, it is possible to prove that this is the case. Now what this means is that if we choose the number of buckets that we hash into to be sufficiently large, we can drive the standard deviation of every of those t estimates to blow our precision parameter epsilon times the frequency of the kth most frequent item, fk. Now this is exactly the additive precision of our estimates that we're shooting for for our approximate pointquery primitive. So choosing the number of buckets large enough fixes the precision issue. Now we also have to ensure that our estimates work with high probability for all data items, and this is what we fix by taking medians of algorithmic number of independent estimates. Okay, so the formal statement about the position and success probability of our scheme is given by the following lemma. This lemma says that if we choose the number of buckets to hash into to be at least the following quantity, which is the maximum of two numbers, well, first the number of buckets has to be bigger than a large constant times k, which is the number of heavy hitters that we're shooting for approximating. At the same time, it also should be larger than our large constant times the following ratio. This ratio of the l2 squared norm of the frequencies of the tail of the stream, divided by epsilon squared, times the frequency of the kth largest element squared. Okay, so if the number of buckets b is chosen to be sufficiently large in the following way, and if we choose the number of repetitions of our hashing scheme to be logarithmic in the length of the stream, then we have that our median estimator is at most epsilon times fk far from the true answer for every data point i and from every position in the stream. Okay, now this is exactly the additive precision that we wanted to get, and it remains to understand what our sample complexity is. Now recall that the algorithm stores this array C which has t rows and b columns. Now the parameter t was chosen to be logarithmic in n, and the parameter b, however, depends in non-trivial ways on the actual structure of the stream. So now the question is, how large is b? This is possible to show that for every stream, b is on the order of k over epsilon squared, for example. This would be very nice. Now unfortunately, this is not true. Indeed, imagine a stream where every item appears exactly once. Then all of the fjs are exactly 1 and we get that we need to set b to be omega of n over epsilon squared. Okay, now this is really terrible because in space order n we could have stored the entire stream anyway and solved the problem exactly. However, note that in the importance case where the stream is actually dominated by the top k elements, b turns out to be very small. Specifically, suppose that if we sum the frequency squared over all the elements jamming the tail of the signal, or rather of the stream, and divide this by k. So think of summing the squared frequencies over the tail of the stream, and then spreading this equally among the top k elements, the heavy hitters. And if this turns out to be on the order of the frequency squared of the kth largest element, then hashing into b on the order of k over epsilon squared buckets turns out to be sufficient to find the heavy hitters. And this is exactly the sense in which we said that we were going to solve the problem for streams where heavy hitters dominate the stream in a certain sense. And the actual formal sense is the l2 sense shown on this slide. Now of course, in practice we don't know what the stream looks like, so we cannot choose b according to this formula. What we can do, though, is choose b as large as possible subject to space constraints because hashing into more buckets only improves the precision guarantees of our l. Okay, so now I would like to spend a little bit more time talking about the number of buckets that we have to hash into and showing that actually this number is surprisingly small for some very interesting cases. And for simplicity let's think of k equal to 1. That is, we have a stream and we want to find the most frequent element in the stream. Now suppose further that this most frequent element, and by our assumption it's element one, appears square root N times in the stream. And all the other elements appear exactly once. So they're N minus root N elements in the tail of this tree. Okay, so now this means that f1 is equal to square of N, and for every i between 2 and N minus root N, fi equals exactly 1. So all of the other elements appear once. Now, the lemma shows that if we set b to be the following quantity, the sum of frequency squared over the tail of the signal, divided by epsilon squared, fk squared, then we're able to find our heavy hitter. Now all the fjs in the tail of the signal are one, so the summation of their squares is bounded by n. On the other hand, f1 is equal to root N, so f1 squared is N as well. Now what this means is that setting b to be a constant divided by epsilon squared suffices to find our heavy hitter, the most frequent element in the stream. And I would like to stress that this a very remarkable result. because if we think about this, the element 1, that is, the heavy hitter, appears only in square root N positions in the stream. So that's a vanishingly small number of positions. If we were to sample a constant number of locations we would never see this element. This means that the algorithm that we have just constructed is quite powerful. Okay, well, to summarize, we started this lecture with the goal of designing a small space data structure that can scan a stream of data items arriving one after another, maintain a small amount of space. And at the end of the stream, output an approximation to the top k most frequent coefficient scene in the stream. We showed that this problem can reduce to the problem of designing a different data structure which we'll call approximate point query. This data structure, again, uses a very small amount of space, scans the stream of data items one after another. And at any point, this data structure can be stopped and asked to return a good approximation to the number of times any given data item has been seen so far. Now we gave a very powerful primitive for the approximate pointquery data structure. It gives very precise estimates with high probability for all data items. Now this primitive also uses a small amount of space on the order of k log N, assuming that the top k elements actually dominate the stream in a certain l2 sense. Now putting these two pieces together gives us the CountSketch algorithm.