0:00

Hi, and welcome to the least squares estimation lecture as part of

Â the regression course in the data science specialization.

Â My name is Brian Caffo and the course is co-taught by myself, Jeff Leek and

Â Roger Peng and run the department of Biostatistics in

Â the Johns Hopkins Bloomberg School of Public Health.

Â 0:18

Consider again, when we're looking at the scatter plot of

Â the parent's heights by the child's heights from the Galton data.

Â Here remember, the size of the circle represents the frequency

Â of that particular x, y combination.

Â 0:47

So let's let Y be the ith child's height and Xi be the ith's parents height.

Â Now we want to find the best line, where we want the line to look like child's

Â height is an intercept, which let's label beta nought plus the parent's

Â height times a slope, which we're going to label beta one.

Â So beta nought and

Â beta one are parameters that we would like to know that we don't know.

Â Well, we need a criteria for the term best.

Â We need to figure out what we mean by the best line that fits the data.

Â Well, one criteria is the famous least squares criteria.

Â And the basic gist of the equation is we want to minimize the sum of the squared

Â vertical distances between the data points, the height of the data points,

Â the child's heights and the points on the line, on the fitted line.

Â And we can write this as summation Yi,

Â the child's heights minus beta nought plus beta 1Xi,

Â where that particular parent's heights would put them on the fitted line.

Â We'll go through this a lot and hopefully, you'll get the hang of it.

Â And then later on at the end of the lecture, I'll actually show you

Â the math of how we can come up with the solution for those that are interested.

Â Let's talk about what our equations mean with a picture.

Â Here's a dataset.

Â What our least squares criteria to find the best re,

Â regression line is going to do is, it's going to take each point.

Â So for example, take this point right here.

Â That's the point x1, y1 and this might be the point x2,

Â y2 and this might be the point xn, y,n.

Â It's going to take all these points and fit a line through the data.

Â 2:43

Let's draw our line right here, it's going to fit a line through the data and

Â the way it's going to pick this line is, it's going to take, for example, our x1,

Â y1 point.

Â It's going to calculate that distance and

Â that distance is y1, the height,

Â that height, minus the point on the line.

Â So this line, let's assume has intercept beta nought.

Â So beta nought is the height on along the vertical

Â axis has intercept beta nought and slope beta one.

Â So that's the line that we're interested in,

Â so this point right there is beta nought plus beta 1X1.

Â This distance, then is subtracting the two.

Â And because that can either be positive or negative, it would be negative in this

Â case we square it and then we do that for every point on the line and add them up.

Â So, each point on the line contributes equally to

Â the error between the line and the fitted points.

Â 4:07

So just to reiterate, we want the line y equal to beta nought plus beta 1X.

Â And we're going to fit it through a scatter plot of points Xi,

Â Yi, where Yi is the outcome.

Â We put little hats over beta nought and beta one to indicate the estimated values.

Â 4:46

So let's go through a couple of consequences of this being the result.

Â First of all,

Â beta 1 hat has the units of Y divided by the units of X, that's it's units.

Â We can see this because the correlation is a unitless quantity.

Â Standard deviation of Y has the units of Y and

Â standard deviation of X has the units of X.

Â 5:08

So because the line, remember is, so

Â because the slope of a line is change in Y divided by change in X,

Â it has to have units of Y divided by units of X units.

Â Another important point is that the line always passes through the point,

Â X bar, Y bar.

Â We can see that for the equation for the intercept.

Â If we rearrange it,

Â we simply get Y bar equals beta nought hat plus beta 1 hat X bar.

Â So, if we plug in X bar into our equation for our fitted line, we get Y bar out,

Â which means that the line has to pass through the point X bar, Y bar.

Â If we reverse the roll of Y and X and treat X as the outcome and

Â Y as the predictor,

Â then we simply get the answer that the slope of this line is the correlation

Â times the standard deviation of X divided by the standard deviation of Y.

Â 6:02

So you get a different answer, of course, when you fit X as the outcome and

Â Y as the predictor than if you fit Y as the outcome and X as the predictor.

Â The slope is the same if you were to center the data first.

Â 6:17

In other words, take each Xi and subtract off its average, take each Yi and

Â subtract off its average, so that now the origin is exactly the mean of the data.

Â And if you were to do regression forcing the line through the origin,

Â you get the same answer.

Â 6:35

I would also note that if you normalize the data, don't just center it, but

Â center it and scale it.

Â The slope is exactly the correlation,

Â because the standard deviation of the X variable is one.

Â The standard deviation of the y variable is one.

Â And so we can see, just from the equation for

Â the slope, that it will just be correlation.

Â