0:00

In the last video, we talked about exponentially weighted averages.

Â This will turn out to be a key component of

Â several optimization algorithms that you used to train your neural networks.

Â So, in this video, I want to delve a little bit deeper

Â into intuitions for what this algorithm is really doing.

Â Recall that this is a key equation for implementing exponentially weighted averages.

Â And so, if beta equals 0.9 you got the red line.

Â If it was much closer to one,

Â if it was 0.98, you get the green line.

Â And it it's much smaller,

Â maybe 0.5, you get the yellow line.

Â Let's look a bit more than that to understand how

Â this is computing averages of the daily temperature.

Â So here's that equation again,

Â and let's set beta equals 0.9 and write out a few equations that this corresponds to.

Â So whereas, when you're implementing it you have T

Â going from zero to one, to two to three,

Â increasing values of T. To analyze it,

Â I've written it with decreasing values of T. And this goes on.

Â So let's take this first equation here,

Â and understand what V100 really is.

Â So V100 is going to be,

Â let me reverse these two terms,

Â it's going to be 0.1 times theta 100,

Â plus 0.9 times whatever the value was on the previous day.

Â Now, but what is V99?

Â Well, we'll just plug it in from this equation.

Â So this is just going to be 0.1 times theta 99,

Â and again I've reversed these two terms,

Â plus 0.9 times V98.

Â But then what is V98?

Â Well, you just get that from here.

Â So you can just plug in here,

Â 0.1 times theta 98,

Â plus 0.9 times V97, and so on.

Â And if you multiply all of these terms out,

Â you can show that V100 is 0.1 times theta 100 plus.

Â Now, let's look at coefficient on theta 99,

Â it's going to be 0.1 times 0.9, times theta 99.

Â Now, let's look at the coefficient on theta 98,

Â there's a 0.1 here times 0.9, times 0.9.

Â So if we expand out the Algebra,

Â this become 0.1 times 0.9 squared, times theta 98.

Â And, if you keep expanding this out,

Â you find that this becomes 0.1 times 0.9 cubed,

Â theta 97 plus 0.1,

Â times 0.9 to the fourth,

Â times theta 96, plus dot dot dot.

Â So this is really a way to sum and that's a weighted average of theta 100,

Â which is the current days temperature and we're looking for

Â a perspective of V100 which you calculate on the 100th day of the year.

Â But those are sum of your theta 100,

Â theta 99, theta 98,

Â theta 97, theta 96, and so on.

Â So one way to draw this in pictures would be if,

Â let's say we have some number of days of temperature.

Â So this is theta and this is T. So theta 100 will be sum value,

Â then theta 99 will be sum value,

Â theta 98, so these are,

Â so this is T equals 100,

Â 99, 98, and so on,

Â ratio of sum number of days of temperature.

Â And what we have is then an exponentially decaying function.

Â So starting from 0.1 to 0.9,

Â times 0.1 to 0.9 squared,

Â times 0.1, to and so on.

Â So you have this exponentially decaying function.

Â And the way you compute V100,

Â is you take the element wise product between these two functions and sum it up.

Â So you take this value,

Â theta 100 times 0.1,

Â times this value of theta 99 times 0.1 times 0.9,

Â that's the second term and so on.

Â So it's really taking the daily temperature,

Â multiply with this exponentially decaying function, and then summing it up.

Â And this becomes your V100.

Â It turns out that,

Â up to details that are for later.

Â But all of these coefficients,

Â add up to one or add up to very close to one,

Â up to a detail called bias correction which we'll talk about in the next video.

Â But because of that, this really is an exponentially weighted average.

Â And finally, you might wonder,

Â how many days temperature is this averaging over.

Â Well, it turns out that 0.9 to the power of 10,

Â is about 0.35 and this turns out to be about one over E,

Â one of the base of natural algorithms.

Â And, more generally, if you have one minus epsilon,

Â so in this example,

Â epsilon would be 0.1,

Â so if this was 0.9, then one minus epsilon to the one over epsilon.

Â This is about one over E,

Â this about 0.34, 0.35.

Â And so, in other words,

Â it takes about 10 days for the height of this to

Â decay to around 1/3 already one over E of the peak.

Â So it's because of this,

Â that when beta equals 0.9, we say that,

Â this is as if you're computing

Â an exponentially weighted average that focuses on just the last 10 days temperature.

Â Because it's after 10 days that the weight decays

Â to less than about a third of the weight of the current day.

Â Whereas, in contrast, if beta was equal to 0.98,

Â then, well, what do you need 0.98 to the power of in order for this to really small?

Â Turns out that 0.98 to the power of 50 will be approximately

Â equal to one over E. So the way to be pretty

Â big will be bigger than one over E for the first 50 days,

Â and then they'll decay quite rapidly over that.

Â So intuitively, this is the hard and fast thing,

Â you can think of this as averaging over about 50 days temperature.

Â Because, in this example,

Â to use the notation here on the left,

Â it's as if epsilon is equal to 0.02,

Â so one over epsilon is 50.

Â And this, by the way, is how we got the formula,

Â that we're averaging over one over one minus beta or so days.

Â Right here, epsilon replace a row of 1 minus beta.

Â It tells you, up to some constant roughly how

Â many days temperature you should think of this as averaging over.

Â But this is just a rule of thumb for how to think about it,

Â and it isn't a formal mathematical statement.

Â Finally, let's talk about how you actually implement this.

Â Recall that we start over V0 initialized as zero,

Â then compute V one on the first day,

Â V2, and so on.

Â Now, to explain the algorithm,

Â it was useful to write down V0,

Â V1, V2, and so on as distinct variables.

Â But if you're implementing this in practice,

Â this is what you do: you initialize V to be called to zero,

Â and then on day one,

Â you would set V equals beta,

Â times V, plus one minus beta, times theta one.

Â And then on the next day, you add update V,

Â to be called to beta V,

Â plus 1 minus beta,

Â theta 2, and so on.

Â And some of it uses notation V subscript theta to denote

Â that V is computing this exponentially weighted average of the parameter theta.

Â So just to say this again but for a new format,

Â you set V theta equals zero,

Â and then, repeatedly, have one each day,

Â you would get next theta T, and then set to V,

Â theta gets updated as beta,

Â times the old value of V theta,

Â plus one minus beta,

Â times the current value of V theta.

Â So one of the advantages of this exponentially weighted average formula,

Â is that it takes very little memory.

Â You just need to keep just one row number in computer memory,

Â and you keep on overwriting it with this formula based on the latest values that you got.

Â And it's really this reason, the efficiency,

Â it just takes up one line of code basically and just

Â storage and memory for

Â a single row number to compute this exponentially weighted average.

Â It's really not the best way,

Â not the most accurate way to compute an average.

Â If you were to compute a moving window,

Â where you explicitly sum over the last 10 days,

Â the last 50 days temperature and just divide by 10 or divide by 50,

Â that usually gives you a better estimate.

Â But the disadvantage of that,

Â of explicitly keeping all the temperatures around and

Â sum of the last 10 days is it requires more memory,

Â and it's just more complicated to implement and is computationally more expensive.

Â So for things, we'll see some examples on the next few videos,

Â where you need to compute averages of a lot of variables.

Â This is a very efficient way to do so both from computation

Â and memory efficiency point of view which is why it's used in a lot of machine learning.

Â Not to mention that there's just one line of code which is, maybe, another advantage.

Â So, now, you know how to implement exponentially weighted averages.

Â There's one more technical detail that's worth for you knowing

Â about called bias correction.

Â Let's see that in the next video, and then after that,

Â you will use this to build

Â a better optimization algorithm than the straight forward create

Â