Hi, and welcome to the lecture on some basic notation and background as part of the regression class in the Coursera Data Science Specialization. My name is Brian Caffo. I'm a professor in the Department of Biostatistics at the Bloomberg School of Public Health here at Johns Hopkins. The class is taught with my co-instructors and colleagues here, Jeff Leek and Roger Peng. In this lecture, we're just going to cover some basic definitions. These are things you probably already saw in the prerequisite for this class in a statistical inference, part of the Coursera spe, Data Science Specialization. However, because they're so fundamental to regression, we're going to cover them again anyway, so they're fresh in our minds. I'm going to try and minimize the amount of mathematics that's required for this class. However, for those that are interested with the appropriate caveats before starting to go through them, I'm going to cover some of the mathematics. I'll also have some more advanced lectures that I'll put on YouTube and some other outlets for people who are interested in the very high level treatment of this. Throughout the lectures, we'll, we will neither require calculus nor linear algebra. And when it does get a little bit more mathematical, I'll let you know when you can skip over those sections, and I'll try to be clear about what would be needed for the evaluations. Okay, so let's go over some of the important notation relevant for this class. So, the biggest bit of notation that you're going to have to get used to index a variable with a set of subscripts to denote an ordered collection. So, we might write X1, X2 up to Xn to describe n data points. So, as an example, consider the data set 1, 2, 5. Then, X1 is 1, X2 is 2, X3 is 5, and n in this case is 3. There's nothing in particular about the letter X. We could have just as easily described Y1 to Yn. The last bit of notation that's important, is we're typically going to use Greek letters for things we don't know, population things we don't know, such as mu for a population mean. And then we'll use non Greek letters or regular letters to denote things that we can observe. So, x-bar is something we can observe. Mu is something we can't observe and would like to estimate. So, let's make use of our new notation. Let's try to find the sample mean of a data set. Well, we all know what the sample mean is. We take our data, add them up, and divide by the number of observations. So, we usually denote a sample mean by the, a letter with a bar over it. So, x-bar is the sample mean in my collection of X's. So, that's just 1 over n, times adding up all the X's. Notationally, we write this as the sum and symbol, where we index it from i equal 1 to n of the Xis, and this is just simply notation for saying, add up all the Xis. Notice, if we subtract a mean from every data point, then we wind up with a new data set with n observations. However, this new obser, this new data set has mean 0. So notationally, if I define Xi tilde as each Xi point, subtracting off the mean, then this new data set, the Xi tildes, have mean 0. This process is called centering the random variables. And you can check it, empirically by simply generating a bunch of vectors, subtracting the mean from each element of the vector, and taking the mean of the vector. And you'll see that it's always 0. Also, recall from the previous lecture that the sample mean Is the least squared solution. So if we want to find the value of mu, that minimizes summation Xi minus mu squared. Then that mu works out to be x-bar. Since we talked about means, let's talk about variances. The variances is usually denoted by s squared. And I give the formula right here. 1 over n minus 1, summation. Xi minus x-bar quantity squared. This is nothing other than basically the average squared deviation of the observations around the mean. There's a shortcut formula which I give over here. The empirical standard deviation is simply the square root of the variance. And it's nice to work with standard deviations because the variance is expressed in whatever units X has squared, whereas the standard deviation is just expressed in the normal units of X. Another interesting fact kind of related to send, not unlike sendering is scaling, so if we subtract a mean off from every observation, we get a resulting data set that has mean 0. If we divide every observation by the standard deviation, the resulting data set will have standard deviation 1. This is called scaling the data. If we take our original data now and subtract off x-bar, then take the resulting centered data and scale it by s. We get a new data set, let's call them Zi. And these have empirical mean 0 and empirical standard deviation 1. This process of centering and then scaling is called normalizing the data. So, normalized data are centered at 0 and have units equal to standard deviations away from the mean. So as an example, if something has a value 2 from normalized data, that means that the data point was 2 standard deviations larger than the mean. As its name would suggest, normalization is an attempt to make otherwise non-comparable data sets comparable. Now, let's talk about the most central quantity in regression, the empirical covariance and, and really the correlation, which is maybe a little bit better to think about. So, imagine I have two ve, two vectors, x and y, and they're lined up. So X1 might be the BMI for subject 1 and Y1 might be the blood pressure for subject 1. And X2 would be the BMI for subject 2. And Y2 would be the blood pressure for subject 2, and so on. So, you could meaningfully do a scatter plot. Then we just define the covariance between X and Y as 1 over n minus 1, summation of the X's, X deviations around their mean, times the Y deviations around their mean. I give a shortcut formula over there. The correlation is simply the covariance then standardized into a unitless quantity. So, the correlation is the covariance of X and Y, which has units, basically units of X times units of Y. And then, we divide by a standard deviation, standard deviation of X and standard deviation of Y, and we get a unit-free quantity. And we'll end our whirlwind tour of notation with some basic facts about correlation. So, the correlation has to between, be between minus 1 and plus 1. And it's only going to achieve these bounds, minus 1 or plus 1, when the X and Y observations fall perfectly on a positively or negatively sloped line respectively. In other words, a positive line for a correlation of 1 and a negative line for a correlation of minus 1. The correlation measures the strength of the linear relationship between X and Y, and we'll see it's central in linear regression. The stronger, we're, we're estimating a stronger relationship the closer the correlation is toward the extremes minus 1 or plus 1. And a correlation of 0 implies no linear relationship.