So welcome to the chat about gradient descent updater strategies. This is a very important lecture, because choosing the correct gradient descent updater strategy might heavily influence your learning. So remember in creating descent, we are testing one random value, because that random value is generated by the random initialization of the weight. And from there, you have to find the next step. So next step, you find of course by computing the first derivative of the cost function and you go in the direction of the steepest descent, but there are some catches. So let's start with great descent as it is. So in gradient descent, you update the weights theta by subtracting a value from it. So the value which you are subtracting is this one, so let's have a look how this is computed. Again, it starts with a cost function J. The cost function J is of course, dependent on theta, the weights, but also on the training data. So the training data, I have exemplified here using a matrix X and the matrix Y. So this is basically the complete data set. And once this cost function is computed under complete data set, we take the first derivative. We get the gradient, then we multiply by the learning rate. So the update will be smaller and then we subtract from the actual value of theta t and that gives us theta t+1. So that's how gradient descent works. So there's a variation, this is called stochastic gradient descent. And the only difference is that you are not computing the gradient under complete matrix x and the complete matrix y, you just take one training example. This is x(i) and y(i). So once you've completed the gradient, you update already the parameters theta. So a variation of this is the so-called mini batch gradient descent. So that's somehow in between. You don't use the complete data set and just don't use a single instance of your data. You just use a batch. So the batch size we have to define, usually you take values between 32 and thousand 24 or so and then you compute the gradient for that particular batch. And once this gradient has been computed, you update the parameter matrix theta. So now, another way of doing this is called momentum. So momentum, the idea is that we take also an update of a pastime step into consideration for computing the update of the current time step. So now, we have here a variable called nu. So nu computed from nu t-1 and usually we take a gamma of 0.9. And from this, basically subtract the actual gradient. Once you've computed nu t, we subtract nu t from theta t and we get the updated theta t+1. So there is a variation of momentum. This is called a Nesterov accelerated gradient. And the only addition here is that in the cost function, we subtract this term here already from theta t. So when we should be this is like a ball rolling down the hill and it's a smart ball. So whenever the slope starts to increase, the ball stops accelerating and breaks a bit. So Adagrad which stands for adaptive gradient tries to change the learning rate over time and not only for a complete patch. It tries to change the learning rate for each individual example. So you see here, this formula is dependent on i. So i stands for the actual training example. We see here that the learning rate theta is modified and it's modified by this term here. So Gt, ii is the matrix which contains information on the pass gradients per training example and taking this into account. It just reduces the learning rate. And note that there is a little term E, so that we avoid division by zero. This whole thing cannot only be computed per example, but also using matrix multiplication. And therefore, we can omit the i, because GT is a diagonal matrix. So don't worry, if you don't understand this. So maybe you have to revisit the linear algebra Basics. But anyway, it's just a way to compute the whole math in just one go. So Adadelta is only a variation of Adagrad and differences that it doesn't use a matrix G, but it continuously computes the mean of the historic gradients. So there are many, many other variations of gradient descent updater and I recommend you to have a look at Sebastian's blog later where everything is described really nicely. But the key take home point here is that the gradient descent updater strategy is a very important job, you have to tune. And as usual, tuning neural network is considered as a split magic or trial and error. So you just have to try a couple of those. So let's actually have a look at Sebastian's blog. So the best thing is this funny guy here. And you see here, those are the trajectories of different learning curves using different gradient descent updater strategies. So it's pretty interesting, because it's a very important type of parameter you can tune. So we have covered most of those, but what I wanted to show you is two little figures here. So those two are really interesting. So you see here different trajectories in optimization space using different gradient descent updater. So the red one here is stochastic gradient descent. And to see here in both problems, it performs really poor and even get stuck. So maybe in the first image, it won't get stuck, but it won't converge for ages. And in the second, this is a saddle point, it even gets stuck. And you can see here that, for example, a delta is performing best among all of those. Okay, I think that's enough for now. I hope that you know understood that this is an important type of parameter which you can tune and that's basically it.