Let's now take a brief journey through the history of machine learning to see how it has evolved over time into the deep learning neural networks that are so popular today. You'll notice that despite neural networks coming in and out of style over the last several decades, that tricks and techniques developed for other algorithms can be applied to deep learning neural networks, making them very powerful. Linear regression was invented for predicting the movement of planets and the size of pea pods based on their parents. Sir Francis Galton pioneered the use of statistical methods to measurements of natural phenomena. He was looking at data on the relative sizes of the parents and children in various species, including sweet peas. He observes something that is not immediately obvious, something really strange. Sure, a larger than average parent tends to produce larger than average children. But how much larger is the child to the average of the other children in its generation? It turned out that this ratio for the child is less than the corresponding ratio for the parent. For example, if the parent size is 1.5 standard deviations from the mean within its own generation, then you would predict the child size will be less than the 1.5 standard deviations from the mean within its cohort. We say that generation by generation, things in nature regress or go back to the mean hence the name linear regression. This chart here from 1877 is the first-ever linear regression, pretty cool. Compute power in 1800s was somewhat limited, so they didn't even realize how well this would continue to work once we had large data sets. There is actually a closed form solution for solving linear regression, but gradient descent methods can also be used, each with their pros and cons depending on your data set. Let's look under the hood on how linear regression works. Let's go a little more into detail to understand the motivations around linear regression. We begin with a linear equation that we are hypothesizing, describes our system by multiplying various weights with our observed feature vectors and then summing it all up. We can represent this in the equation above for each example in our data set y = w0x0 + w1x1 + w2x2, and so on for each feature in our model. In other words, we apply this equation to every row in our data set, where the weight values are fixed and the feature values are from each associated column in our machine learning data set. This can be nicely packaged into the matrix equation below, y = Xw. This hypothesis equation is very important, not just for linear regression, but for other machine learning models, such as deep neural networks, which we will discuss later. But how can I determine if the weights I have chosen are making good or bad guesses? The answer is we need to create a loss function, which essentially it's just an objective function we want to optimize. As explained earlier, typically for regression problems, the loss function is mean squared error, which in matrix form is shown in the equation here. I dropped the constant since it will disappear later in the derivation. We are first finding the difference between the actual labels value with our predicted labels value y hat which is just Xw. Remember though, that my goal is to reduce the loss as much as possible. So I need to find a way to minimize it with respect to the weights. To do this, I take the derivative with respect to the weights, in the one-dimensional case, or more generally the gradient when I have multiple features. I can then use this to find the global minimum. The equation here, I won't get into the derivation, provides a closed form analytical solution for linear regression. Meaning that, if you plug in the X and y values into this formula, you will get the values for the weights. But this is not very practical, there are issues with the inverse. We are first assuming that the Gram matrix, X transpose X, is non-singular, meaning that, all the columns of our feature matrix X are linearly independent. But in real world data sets, you do have duplicate or nearly duplicate data, the same customer buying the same product again, to photos the same sunrise taken seconds apart. Even if the Gram matrix is technically linearly independent, it could still be ill-conditioned. Therefore, making it computationally singular and still causing problems for us. The inverse also has a time complexity of on cubed, using the naive algorithm. But still, it doesn't get much better using fancy algorithms, and those come with some of their own numerical issues. The same goes for even the multiplication to create the Gram matrix. We might instead solve the normal equations using something called a Cholesky or QR decomposition. For on cubed or even on to the 2.5 when n equals 10,000 or more, the algorithm can be very slow. So yes, you can solve exactly for the weights use the normal equation, but it is very dependent on your data, your model, which linear algebra matrix algorithms you are using, etc. Thankfully, there is a gradient descent optimization algorithm, which is one, less expensive computationally in both time and/or memory, two, more amenable to model generalization, and three, generic enough to work on most problems. Instead, in gradient descent, we have our loss function, or more generally our objective function, that is parameterized by the weights of our model. Within this space, there are hills and valleys just like the Earth has. However, in many machine learning problems, there will be many more dimensions than the 3D spatial world that we live in. Since this is gradient descent, minimization along the gradient, and that ascent, which instead would be maximization, we want to traverse the loss hypersurface searching for the global minimum. In other words, we hope to find the lowest valley regardless of where we start on the hypersurface. This can be done by finding the gradient of the loss function and multiplying that with a hyperparameter, learning rate, and then subtracting that value from the current weights. This process iterates until convergence. Choosing the optimal learning rate and waiting for many iterations could make you choose to use the normal equation instead, assuming the number of features a small and no collinearity issues, etc, or add a gradient descent optimizer such as momentum or using a decaying learning rate. We'll talk much more about the details of gradient descent in the next module. What is a hyperparameter that helps determine gradient descent's step size along the hyperservers to hopefully speed up convergence? The correct answer is learning rate. The learning rate along with some other hyperparameters that you will learn in the future modules, helps aside the step size in gradient descent. If set too low, then gradient descent takes a very long time to converge. If set too high, the gradient descent might even diverge and increase the loss more and more. The other three answers all have to do with collinearity and conditioning, which we don't have to worry about with gradient descent like we would using the normal equation.