So in this lecture, we're going to introduce one of the first supervised learning concepts. We'll see which is linear regression and in doing so it will also explain and try to understand the role of parameters in a model. So regression is one of the simplest supervised learning approaches that helps us to try and capture relationships between the input variables, which we call features, and output variables, which are the predictions we'd like to make. So to motivate this problem and understand what those concepts mean, imagine we wanted to understand the relationship between height and weight or equivalently, we wanted to come up with a predictor, so can we predict a person's weight if we know that height. So that problem might look something like this. We're given some data which contains height and weight observations. Here we might have a height on the x-axis, that's the variable we observe, and weight on the y-axis, that's the variable we're trying to predict. What we really have is essentially a scatter plot or a bunch of points representing height, weight observations. So we want from that is to come up with what's called a line of best fit. A line that explains the relationship between height versus weight. Okay. So if you can find that line, we can then use it as a predictor. In other words, we can estimate a person's weight given their height. So how do we actually formulate this problem? How do we formulate the problem of finding a line given a bunch of observations in the scatter plot? Now, obviously, there's going to be some deviation of the real data from this line. So if we can't find a line that fits the data exactly, how do we approximate this problem? In other words, what is the best line? There are many different lines we can find that approximately fit the data. How do we characterize what is a good line versus a bad line and then how do we find the best one? Okay. So to recap some mathematical equations, how to come up with a formula that describes this line in the first place. So we have a line describing the relationship between height and weight here. In general terms, a line in two dimensions is going to be given by the formula, y equals mx plus b where y is your y-axis here or weight, x is your x-axis or height. M then is going to be what's called a slope or gradient term which measures how much does the y value change as the x value changes, and b is what's called an intercept. Essentially that measures what is the value of y going to be when x is equal to zero? In other words, where does the line intercepts the y-axis? Okay. So that's a line in general terms. In a specific case, we would be saying that weight, the y-axis, is equal to m times height plus b. So m measures how much weight changes for each unit change of height, and b, the intercept, essentially measures what would the weight be if the height is zero. Of course that's impossible, but still the model assumes we can extrapolate these values right down to zero right up to infinity. Okay. So that's just line in two dimensions. Of course, we can do this more generally for as many dimensions as we'd like. But we have a model of height, weight, and age and our observations are actually triples of these three different values and we have now a 3D scatter plot hardest to draw. Well, we can still imagine fitting a line that describes the relationship between these points in three dimensions. Okay. So here we would say the weight is equal to m1 times height plus m2 times age plus b, which will have a single intercept but we now have two slope or gradient parameters saying how does weight change as a function of height and how does it change as a function of age? Well, again in more general terms and you can do this to any number of dimensions, we would say that y is equal to m1 times x1 plus m2 times x2 plus m3 times x3 and so on and so forth plus b. Okay. So what would become more useful in a minute is to try and write out this expression by separating what we call features and parameters. In other words, knowns and unknowns. Rather than writing out weight equals m1 times height plus m2 times age plus b, we can instead write this out as an inner product. So all I've done here is I've taken the known observations height and age and I put them on the left-hand side and I've taken the unknowns, the parameters, gradients and the intercept term and I put them on the right-hand side. I also have this number one here which is going to be multiplied by b to correspond to the intercept term and now all I've done is written this out as a inner product between some features and parameters, which will become useful as I try and solve this equation in a minute. All right. So very generally speaking no matter how many dimensions you have, we can say that y, the value we're trying to predict, can be written as an inner product between x, the features we observe, and Theta, the parameters we're using to fit that model which includes the gradient and intercept terms. Okay. So we have y, what we're trying to predict, x, the features we're using to predict it, and Theta, the parameters that we can choose to make that prediction be a better or worse approximation. Okay. But in practice, we can have many different observations. We don't just have one value of y and one set of features x, but we have many observations y and many different features x. We have the heights and weights and many different people. So we can try and rewrite this using a matrix. We would say the height person's weight is going to be a function of the height person's height. So now we have the predictions we want to make stored in height yi and the features we're using to make this prediction stored in each row of a matrix X. Okay. Writing that out fully, I would say all of the predictions I want to make y can be written out as this matrix product of my entire matrix containing all of my features one row for each observation multiplied by these parameters. Okay. That's essentially the concept of linear regression where assuming a predictor of this form where the outputs we're trying to predict y are given by this matrix product about feature matrix X multiplied by our parameter vector Theta. You may also have seen this written as Ax equals b in some textbooks. Okay. So that's how general form about predictor x times Theta equals y. How do we go about actually solving this for Theta? In other words, we want to come up with a solution Theta that best describes this line of best fit. So to solve this for Theta, we need to do some matrix operations. Let's go ahead and try and do that. We have this equation X Theta equals y. We'd like to solve that for Theta. The most obvious thing we might try is to say, okay, let's multiply both sides by the inverse of X. So X inverse times X and the left-hand side and X inverse on the right hand-side. Seems like a reasonable idea, unfortunately doesn't walk because there's no reason X needs to be a square matrix. We can have thousands and thousands of observations and any number of pages we like for each observation. So its inverse is not going to be defined, it's only defined for square matrices. So a trick to get around that is to instead multiply by X transpose. Okay. Now, if we have X transpose times X, that's certainly going to be a square matrix, so now we can go ahead and invert it. So let's multiply both sides by X transpose X inverse, and everything on the left-hand side will now cancel out and that gives us a solution for Theta. Okay. Here's just the same thing written out a bit more neatly. We find that Theta is equal to X transpose X inverse multiplied by X transpose y. Later on, we'll see how we can solve this using library functions and how we can solve this manually using linear algebra, but essentially that's how the process looks. Okay. So just to summarize, in this lecture we've introduced the concept of regression, one of the simplest forms of supervised learning. In particular we looked at linear regression, which is used there exists some line of best-fit that explains the relationships between the features and the data. We've tried to express finding this line of best fit and solving a system of matrix equations and we've seen a solution to that which later on we'll explore using various libraries. So on your own, I would suggest going ahead and trying to write down the same type of line of best fit equation for a different problem for example, trying to predict a person's income from various demographic features.