0:00

I want to show you a few optimization algorithms.

Â They are faster than gradient descent.

Â In order to understand those algorithms,

Â you need to be able they use something called exponentially weighted averages.

Â Also called exponentially weighted moving averages in statistics.

Â Let's first talk about that,

Â and then we'll use this to build up to more sophisticated optimization algorithms.

Â So, even though I now live in the United States,

Â I was born in London.

Â So, for this example I got the daily temperature from London from last year.

Â So, on January 1,

Â temperature was 40 degrees Fahrenheit.

Â Now, I know most of the world uses a Celsius system,

Â but I guess I live in United States which uses Fahrenheit.

Â So that's four degrees Celsius.

Â And on January 2,

Â it was nine degrees Celsius and so on.

Â And then about halfway through the year,

Â a year has 365 days so, that would be,

Â sometime day number 180 will be sometime in late May, I guess.

Â It was 60 degrees Fahrenheit which is 15 degrees Celsius, and so on.

Â So, it start to get warmer, towards summer and it was colder in January.

Â So, you plot the data you end up with this.

Â Where day one being sometime in January, that you know,

Â being the, beginning of summer,

Â and that's the end of the year,

Â kind of late December.

Â So, this would be January, January 1,

Â is the middle of the year approaching summer,

Â and this would be the data from the end of the year.

Â So, this data looks a little bit noisy and if you want to compute the trends,

Â the local average or a moving average of the temperature,

Â here's what you can do.

Â Let's initialize V zero equals zero.

Â And then, on every day,

Â we're going to average it with a weight of 0.9 times whatever appears as value,

Â plus 0.1 times that day temperature.

Â So, theta one here would be the temperature from the first day.

Â And on the second day, we're again going to take a weighted average.

Â 0.9 times the previous value plus 0.1 times today's temperature and so on.

Â Day two plus 0.1 times theta three and so on.

Â And the more general formula is V on a given day is 0.9 times V from the previous day,

Â plus 0.1 times the temperature of that day.

Â So, if you compute this and plot it in red,

Â this is what you get.

Â You get a moving average of what's called an

Â exponentially weighted average of the daily temperature.

Â So, let's look at the equation we had from the previous slide,

Â it was VT equals,

Â previously we had 0.9.

Â We'll now turn that to prime to beta,

Â beta times VT minus one plus and it previously,

Â was 0.1, I'm going to turn that into one minus beta times theta T,

Â so, previously you had beta equals 0.9.

Â It turns out that for reasons we are going to later,

Â when you compute this you can think of VT as approximately averaging over,

Â something like one over one minus beta, day's temperature.

Â So, for example when beta goes 0.9 you could think of

Â this as averaging over the last 10 days temperature.

Â And that was the red line.

Â Now, let's try something else.

Â Let's set beta to be very close to one,

Â let's say it's 0.98.

Â Then, if you look at 1/1 minus 0.98,

Â this is equal to 50.

Â So, this is, you know, think of this as averaging over roughly,

Â the last 50 days temperature.

Â And if you plot that you get this green line.

Â So, notice a couple of things with this very high value of beta.

Â The plot you get is much smoother because you're now

Â averaging over more days of temperature.

Â So, the curve is just, you know,

Â less wavy is now smoother,

Â but on the flip side the curve has now shifted further to

Â the right because you're now averaging over a much larger window of temperatures.

Â And by averaging over a larger window,

Â this formula, this exponentially weighted average formula.

Â It adapts more slowly,

Â when the temperature changes.

Â So, there's just a bit more latency.

Â And the reason for that is when Beta 0.98 then it's

Â giving a lot of weight to the previous value and a much smaller weight just 0.02,

Â to whatever you're seeing right now.

Â So, when the temperature changes,

Â when temperature goes up or down,

Â there's exponentially weighted average.

Â Just adapts more slowly when beta is so large.

Â Now, let's try another value.

Â If you set beta to another extreme,

Â let's say it is 0.5,

Â then this by the formula we have on the right.

Â This is something like averaging over just two days temperature,

Â and you plot that you get this yellow line.

Â And by averaging only over two days temperature,

Â you have a much, as if you're averaging over much shorter window.

Â So, you're much more noisy,

Â much more susceptible to outliers.

Â But this adapts much more quickly to what the temperature changes.

Â So, this formula is highly implemented, exponentially weighted average.

Â Again, it's called an exponentially weighted,

Â moving average in the statistics literature.

Â We're going to call it exponentially weighted average for short and

Â by varying this parameter or later we'll see

Â such a hyper parameter if you're learning algorithm you can get

Â slightly different effects and there will usually be

Â some value in between that works best.

Â That gives you the red curve which you know maybe looks like

Â a beta average of the temperature than either the green or the yellow curve.

Â You now know the basics of how to compute exponentially weighted averages.

Â In the next video, let's get a bit more intuition about what it's doing.

Â