Now, let's talk about optimizers. So in this video, we're going to discuss different optimizers available to us when learning the appropriate weights for our given dataset and our neural net model. So far we've discussed different approaches to gradient descent that vary the number of actual data points involved in each one of our steps in our gradient descent steps. Such as a single data point in stochastic gradient descent, a subset of data points with mini batch gradient descent, an entire set with full batch gradient descent. Now, no matter what we use, they all had that same update formula to find the optimal weights. Our weight at the next iteration is going to be equal to the prior weight minus the gradient time some learning rate alpha. But there are actually several variants to the step of updating the weights that will give us better performance. And these tweaks to the updating step will all be built around improving further and further from this original formulation that we see here. And these different methods of updating the weights or optimizing these weights are going to be called optimizers. So let's start with the concept of momentum. With the regular gradient descent, you'll generally move slowly towards your optimum, and you can be changing direction fairly frequently. Now, with momentum, you're going to smooth out this process. And you do this by taking somewhat of a running average of each of the steps and the smoothing out that variation of each of the individual steps for regular gradient descent. So if we look at our formula, we see that rather than just updating our weights with that gradient, we also look back to prior values to smooth out these steps. So this is the at step t that we have, will incorporate some amount of V at time or at step t-1. As well as the current gradient at the step that we are at. With this in mind, our n value here is going to denote our momentum hyperparameter. And the larger the value is for that momentum hyperparameter, the more we are going to be smoothing out our values. In other words, the more we are incorporating past values into our running average. And will be giving values less than 1 in general and a common value chosen here is going to be 0.9. But again, if you want smoother steps, use a higher value, otherwise use a lower value. Also worth noting, if you want to look at perhaps further reading on your own in regards to momentum, often that term n is going to be replaced by beta. So beta is going to be the common nomenclature for that value and the alpha is replaced by 1 minus beta. So n is going to be replaced by beta and the alpha that we see here that were used to using as that learning rate is going to be 1 minus beta. And when we choose our n and our alpha in practice, we may want to keep in mind using this relationship. So if we choose and an equal to 0.9, you'll probably want to use an alpha around 0.1. So just to show this in terms of a picture. For gradient descent, we can see that we take small steps that can fluctuate quite often. Now, with momentum, we tend to smooth out those steps. The fluctuations aren't going to be as dramatic and the steps in get much larger as momentum is gained. Also worth noting, is that momentum can cause you to actually overshoot your optimum value. But the momentum will shrink at this point, and you should be able to come back to that optimal value as we see here in the picture. So the idea with Nesterov Amentum, which will build off of the momentum we just learned, is going to be that it will look and control for this problem of overshooting, and they'll do so by looking one step ahead. So now, rather than just taking the momentum and taking into account the gradient at the current step, we take the momentum and the gradient at the step with that momentum accounted for. So you see, rather than just taking the gradient of the cost function as we did before, we take the gradient of the cost function with n times Vt- 1 timestep before accounted for. And this will work because generally speaking, the momentum vector will be be pointing in the right direction. So be a bit more accurate to use the gradient with a momentum accounted for than the gradient at that original position. So we think of standard momentum steps, we see that by using the path steps. We can take larger steps that are closer to the correct direction. And if we separate out now just that momentum turn in our last equation. This is going to be the direction that it actually takes. And then taking the gradient with the momentum accounted for as we do with Nesterov Momentum, we have this extra correction in the right direction. And the Nesterov steps move even more smoothly towards our optimal value. Now, let's move a bit away from this concept of momentum and talk about the AdaGrad optimizer, which is short for adaptive gradient algorithm. The idea here is to scale the update for each weight separately as we do our gradient descent and we update our weights. So what will this do? What this will do is it will frequently updated weights a bit less. And while updating, it will keep a running sum of each of the prior updates. And then any new updates will be scaled down by a factor of the previous some so that the steps continuously decrease. So let's look at what this actually means. The key difference when we do AdaGrad compared to our normal gradient descent is this term G. And this term G will continue to increase as will be starting at 0 and will keep on adding squares of that derivative that we see here. And obviously, squares will always be positive, so G will continuously increase. Then in order to update W, rather than just using the learning rate, we use the learning rate divided by the square root of this G value. And since G is continuously increasing, we know that the learning rate will continuously decrease and this will lead to smaller and smaller updates at each iteration. So as we get closer and closer to the optimal value, that learning rate will shrink as we get closer and will help us avoid that overshooting. Now, I'd like to move on to another optimization method, namely RMSProp or root mean square propagation is what that short for. Now, we're working with a very similar functionality as the outer grad that we just discussed. Except that rather than just using the some of our prior gradients, were going to be decaying older gradients and giving more weight to more recent gradients. And this could be similar to the functionality that we use for momentum. Now, we're just using that weighting that we discussed for momentum except for the learning rates. And this will allow for updates to be more adaptive to recent gradients, and is usually much more efficient than working with just AdaGrad. And then finally, we have this concept of Adam, this optimizer Adam, which is for adaptive moment estimation. Don't worry too much about what it's short for, but this will combine both the concept of momentum and this RMSProp that we just discussed, putting them both together. So on the left side herem we had values similar to momentum. If you recall our discussion during momentum or just going to where you're placing our end with beta one and our Alpha with one minus beta one which can be used for the momentum in our past formula as well as we discussed. Now we didn't get into the math of RMS prop, but I did mention that will work similar to the formula for momentum, which is what we see here to the left. So to the right forearm sprouts. RBT value which synthesis forearms proportion is specific to our learning rate. Will have a very similar update to give most weight to the most recent values. Now, I'd like to note here, if you're trying to figure out how to default each one of these values, beta one and beta 2. By defaults, beta 1 will be 0.9 and beta 2 will be 0.999. And they generally do not need to be played around with too much, but you can't play with play around with them, but if you find that you're not getting to the optimal model. Now, there's going to be a bit of bias built into each of these terms. So for mt, we're going to want to correct that bias by dividing by 1- b to the t, and this is meant more for correction towards the beginning. As you can imagine, as t is growing, the larger t is the smaller beat of the t will be, it will continue to shrink as t gross. And then we do the same for Vt, which again is the RMSProp portion. And finally we update our weights using our special learning rate, scaled for VT that we just calculated, multiplied by our momentum term empty. And there we have it or Adam Optimizer combining both RMSProp and this concept of momentum. Now, which one should we choose between each one of the optimizers that are available to us? Now, RMSProp and Adam had become quite popular and from 2012 to 2017 approximately 23% of deep learning papers submitted to this popular platform for research in deep learning, mentioned using the Adam approach. Now it can be difficult though to predict in advance which one of these approaches will work best for a particular problem, and this is actually still an active area of inquiry in deep learning research. Now I would say it's important to note that while Adams speeds up the optimization process tremendously and usually does a fairly good job at finding optimal solutions, there are going to be times when it does have trouble conversion. And there are actually even different versions of Adam that have been implemented in recent, that discovered recently. With that, I would say whether using different iterations of Adam or other optimizers that we just discussed that may speed up the training. If you're still having trouble with convergence, I would note to at least try using just regular mini batch gradient descent or full batch or stochastic gradient descent as well. So, just to recap, in this section we run over why it's so important to have regularization with deep learning models. As these complex models are powerful enough to fit almost exactly to our training data, and with that in mind, we went over different regularization techniques, such as what we've seen in rich with adding on a penalization term for higher weights within that cost function. As well as as we see here in the next bullet using drop out so that our models aren't overlaying on particular pathway to the network as well as early stopping, where we may be checking against a validation set as we trained to prevent are over fitting. And finally we discussed different optimizers available to us. Be on that regular gradient descent, including using momentum RMSProp or combining the two using Adam. Now that closes out this set of videos in the next set of videos will review some of the extra pieces to keep in mind when building out our actual neural networks, that will closeout all we need to know to get started and tuning our own neural networks in Python. All right, I look forward to seeing you there.