So we need to define the no collisions, the small variance, and the small deviation events. Let's start with no collisions. We think of i and r as fixed. We let the no collisions of r and i event denote the event that under hash function h r. None of the head elements, j distinct from i, collide with i or has to the same bucket. A note that this definition makes sense for any data item i, whether or not it belongs to the head or the tail in the stream. Okay, so why does this event hold with rather high probability? Well, recall that our hash function was chosen at random, so for any two distinct elements, i and j. The probability that they collide under the hash function is at most one of the number of buckets] that we hash into. And in fact, it is equal to the number of buckets, but we only need inequality in this instance. Good. Well by the assumption of dilemma. The number of buckets was chosen to be at least a large constant times k the number of heavy hitters. So now, by union bound over o j on the head of the stream, we have that the no collisions event occurs with reliability. At least 1- k over b, which is at least seven eights by our assumption. Great, so let me remind you that we wrote the variance of our estimator as a contribution from the head of the stream. And the contribution from the tail and conditioned on the no collisions event that we just defined. The first contribution's exactly zero, because the summation is over the empty set. So we now know, we now need to handle the second sum. To that effect, we define the small-variance in it. Formerly, we say that the small variance of r and i event occurs if the following conditions do. If the summation over elements j and the tail of the signal that are distinct from I and also hash into the same bucket as I. Under the Rth hash function of their frequency squared is at most eight over b. The number of R hash buckets times the summation of all J in the tail of the frequency of J squared. Well, why does this event hold with high probability? Well again, recall that our hash function is uniformly random. So for every pair i n j that are distinct, the probability that i n j collide is at most 1 over b. What this means is, that we can take the expectation of the left hand side of this equation and we can write it as follows. The expectation of the left-hand side can be written as the summation over all j. And the tail of the signal that stems from i of fj squared, times the probability of j and i colliding. And that is at most one over b of the entire sum. Okay, so now this is the expectation of the left-hand side. Why is the left-hand side actually with rather high probability at most eight times higher than its expectation? Well, now that's by Markov's inequality.. Markov's inequality says, that for every non-negative random variable X with mean mu, the probability that X. Should overs mean by a factor of k is it most one of a k? So now what we do is, simply apply Markov's inequality to the left-hand side of our expression with k equal to eight. Good, so we just proved that the expectation of the left-hand side of our inequality is it most one over b times the summation. Over all j in the tail of the signal, their frequency squared. By Markov's inequality, the probability that this small variance event occurs is at least 1-1/8. That is at least seven over eight, so we wrote the variance of our estimator as the contribution from the head of the stream. And the contribution from the tail of the stream. Then, we defined two events. The no-collisions event and the small-variance event. Such that conditioned on both of these events. The first contribution is just zero, because the summation is over an empty set. And the second contribution is at most 8 over b times the sum over all data items j and the tail of the stream of their frequency squared. At this point, we need the very last piece. The small deviation event. The small deviation event occurs by definition if the squared deviation of our estimator. Mean is at most eight times as variance. How about Chebyshev's inequality? This happens with probability at least seven eights. Okay, well at this point we actually already have a basic version of our lemma. From which the full version will follow by standard concentration of equalities. Now, to state this basic version of the lemma, it is convenient to introduce a parameter of gamma, defined at the top of the slide. We'll let gamma be the square root of 1 over b of the sum of frequency squared all of our frequencies in the tail of the stream. Now that we have proved that the three events that we defined occurred simultaneously with probability five over eight at least. We actually have the following claim, stated as the Lemma on the slide. If the number of buckets b that we hash into is at least eight times k, then for every data item i, and for every r. Between 1 and t, the probability that rth estimate for the frequency fi is 8 gamma close to the true answer, is at least 5 over 8. This can be verified by substituting Gamma into our expressions. Okay, so this is very good. There are two things that we need to understand. Well first, the success probability is only five over eight for a fixed data item i and for a fixed r. Whereas we need our estimate to work with high probability for all data items i. We need to fix that and then, we need to relate the precision guaranteed by this Lemma to the precision that we want to achieve. That is additive epsilon times f k error term, so let's fix the success probability first. And I'm claiming that using standard concentration equalities. That is Chernoff bounds that I will not get into details here. One can show starting from the Lemma on the previous slide. That if we take a sufficiently large logarithmic number of independent copies of our estimator then the following will be true. For every data item i with all but say n over 1 to the 4 probability. The median of our estimators will be 8 gamma close to the true answer. Now, this follows by standard concentration equalities turn of bounds but I will not give the formal proof here. Okay, well, this is very good. Because we have just been able to show that our estimators succeeds. That is, gives an answer that is 8 gamma close to the truth. For every fixed data item i with probably at least one minus one over n to the four. So, now we can take the union bound over all data items in the universe, of which there are at most n. Those that appear in the stream, and conclude that in fact, our estimator is correct for all data items simultaneously. With probably at least one minus one over n cubed. So we know that for all data items now, with high probability the estimator is 8 gamma close. To get a proof of our final Lemma. All we need to understand is how does this 8 gamma relate to the additive error term of epsilon times f k. the frequency of the k-th largest element that we wanted to achieve and this can be verified by direct substitution. Recall that the number of buck is B was chosen to be appropriately large. Where large depends on the actual statistics of the stream. This was the assumption of our lemma. One can verify it by direct substitution into the expression for gamma. That 8 gamma turns out to be less than epsilon times fk and thus concludes the proof of the Lemma.