So, welcome to the lecture about Gradient Descent Updater Strategies.

This is a very important lecture because choosing

the correct Gradient Descent Updater Strategy might heavily influence your learning.

So remember, in Gradient Descent,

we are testing one random value because

that random value is generated by the random initialization of the weights.

And from there, you have to find the next step.

So next step, you find, of course,

by computing the first derivative of

the cost function and you go in direction of the steepest descent.

But there are some catches,

so let's start with Gradient Descent as it is.

So in Gradient Descent,

you update the weights,

theta, by subtracting a value from it.

So, the value which we are subtraction is this one.

So, let's have a look how this is computed.

Again, it starts with the cost function J.

The cost function J is,

of course, dependent on theta,

the weights, but also on the training data.

So the training data,

I have exemplified here using Matrix X,

and the Matrix Y.

So, this is basically the complete data set.

And once this cost function is computed under complete data set,

we take the first derivative,

we get the gradient,

then we multiply by the learning rate.

So, the update would be smaller,

and then we subtract from the actual value of theta_t,

and that gives us theta_t plus one.

So, that's how gradient descent works.

So, there's a variation.

This is called Stochastic Gradient Descent,

and the only difference is that you are not computing the gradient under

complete matrix X and the complete matrix Y.

You just take one training example. This is X_i and Y_i.

So once you have computed the gradient,

you update already the parameters theta.

So, a variation of this is the so-called Mini Batch Gradient Descent,

so that's somehow in between.

You don't use the complete data set and just don't use a single instance of your data,

but just use a batch.

So the batch size, you have to define.

Usually, you take values between 32 and 1024 or so,

and then you compute the gradient for that particular batch.

And once this gradient has been computed,

you update the parameter matrix theta.

So now, another way of doing this is called momentum.

So momentum, the idea is that we take also an update of

a past timestep into consideration for computing the update of the current timestep.

So now, we have here a variable called nu.

So, nu is computed from nu_t minus one.

And usually, we take a gamma of 0-9.

And from this, it basically subtract the actual gradient.

Once you've computed nu_t,

we subtract nu_t from theta_t and we get the updated theta_t plus one.

So, there's a variation of momentum.

This is called Nesterov Accelerated Gradient,

and the only addition here is that in the cost function,

we subtract this term here already from theta_t.

So the usual way, this is like a ball rolling down the hill and it's a smart ball.

So whenever the slope starts to increase,

the ball stops accelerating and breaks a bit.

So Adagrad, which stands for a Adaptive Gradient,

tries to change the learning rate over time and not only for a complete batch.

It tries to change the learner rate for each individual example.

So you see here, this formula is dependent on i.

So, i stands for extra training example.

You'll see here that the learning rate eta

is modified and it's modified by this term here.

So, G_t, ii is the matrix which contains

information on the past gradients per training example.

And taking this into account,

it just reduces the learning rate.

And note that there is little term, e,

so that we avoid division by zero.

This whole thing cannot only be computed per

example but also using matrix-matrix multiplication,

and therefore, we can omit the i,

because G_t is a diagonal matrix.

So, don't worry if you don't understand this,

so maybe you have to revisit the linear algebra basics,

but anyway, it's just the way to compute the whole math in just one go.

So, eta data is only a variation of Adagrad,

and the difference is that it doesn't use a matrix G,

but it continuously computes the mean of the historic gradients.

So there are many, many variations of gradient descent updates,

and I recommend you to have a look at Sebastian's blog later,

where everything is described really nicely.

But the key take home point here is that

the Gradient Descent Updater strategy is very important note you have to tune,

and as usual, tuning a neural network is considered as black magic or trial and error.

So, you just have to try a couple of those.

So, let's actually have a look at Sebastian's blog.

So, Sebastian is this funny guy here.

And you see here,

those are the trajectories of different.

learning course, using different gradient descent updater strategies.

So, it's pretty interesting because it's a very important hyper parameter you can tune.

So, we have covered most of those.

But what I wanted to show you is two daily figures here.

So, those two are really interesting.

So you see here, different trajectories in

optimization space using different gradient descent updaters.

So, the red one here is stochastic gradient descent and you see here in both problems,

it performs really poor.

And it even gets stuck.

So maybe in the first image,

it won't get stuck but it won't converge for ages and in the second,

this is a settle point. It even gets stuck.

And you can see here that for example,

AdaDelta is performing best among all of those, Okay?

I think that's enough for now.

I hope that you now understood that this is

an important hyper parameter that you can tune, and that's basically it.