I'm going to start this lesson with an example that you know what I'm saying doesn't make sense. I'm doing it to make a point, so stay with me. Let's say a student thinks that her test score is influenced by the temperature outside. For this she collected some data and since she has just learned about regression, she develops the equation and now she's happy that she can predict her test scores beforehand. Like I said, you know that this will not be the case. And here is the scatter plot of the data she has. This is not looking like a pattern. However, if she used Excel to run the regression, she will get the following. Focusing on the part of the table that gives the slope and the intercept, then the regression equation is 47.896 minus 0.0342 times x, where x is the temperature outside. The weather forecast says tomorrow the high will be 53 degrees so the student can expect the test score of 46. How can I convince her that she has a bad model at hand and this should not be used? This is the point I want you to get in this lesson. In real life there are instances that the analysis is worthless. But they're just not as easy to detect as my overly simple and ridiculous example here. First, you should know that no matter what the data you use in your regression analysis, Excel will always find the line that fits those observations using the least square method. In this case, the line will be this. But just because we have an equation for a line doesn't mean that we have found something meaningful. So in general, how do we detect a model which is poor versus a model that is good? We now turn to the question of how useful is a particular regression model? After all, a model that explains 90% of the variation is more useful than one that explains only 10% of the variations. One measure of usefulness of a regression model is the simple coefficient of the termination. This coefficient is represented by r square. And most of the time it is just called r square and not by it's proper name of simple coefficient of determination. It is important to know how we calculate r square so that it's meaning would become clear. The sources of variation when performing regression are from two sources. One is from the regression variable we have identified. We call this the explained variation. And the other source which is unexplained are called error or residuals. And this is everything else that causes variation. So think about it this way. Total variation is the sum of these two sources. If we have identified the variable, which is influencing the variable with a strong predictive ability, the majority of the variations we see in the response variable should be due to the regression variable. But if we have identifier variable which has very little influence on the response variable then the majority of the variation we observe in the response variable will be due to other sources, what we call unexplained sources. The notation for total variation is this. First you see SS in all parts. Again, SS stands for sum of squares. This is the formula for total variation. SST, which stands for sum of square of total, is the sum of the square differences between observed y value and the average y value. SSR, which is sum of square due to regression variable, shows the variation that is explained by the regression model. And SSE is the sum of square due to error, which is the variation that is unexplained by the regression model we have. R square, the simple coefficient of determination, is the ratio of the explained variations to the total variation. R square, being a ratio, will always be a value between 0 and 1. The closer r square is to 1, the stronger is the regression model. A regression model with an r square of 90% means that 90% of the variations in the response variable y can be explained by the independent variable x. Let's look back at the Excel output from our example of effect of temperature on test score. Focus on the top table, the highlighted field, that is r square for our example. It is 0.0006, extremely small and close to zero. This is a good reason to throw this model away. Now let's practice. This is the output for the example we used in the previous lesson. Effect of population on annual sales. What is your opinion of this regression model? How much of the annual sale differences between stores is explained by the size of the population living near the stores and how much by other things? R square here is 0.7131, so about 71% of annual sales can be explained by the population living near that store's location. So this is a very good model. However there are other sources of variations and collectively they explain about 29% of the variations. You often hear and may even use the word correlation. In linear regression, simple correlation coefficient measures the strength of the linear relationship between y and x and is denoted by r. And it's calculated by taking the square root of r square. Correlation, which is denoted by r, is positive if slope is positive. Which means x has a positive coefficient, and we say x and y are positively correlated. If the coefficient of x is negative, then we say they are negatively correlated. Correlation can take on any value between negative one and one. The closer to one, the stronger the relationship. In Excel, Multiple R shows the correlation and it appears in the top table. In this example, where we learned analysis on effect of temperature and test scores, not surprisingly, r is very close to zero. It's about 0.026, which is by the way, the square root of 0.0006, which is our r square. And the sign comes from the coefficient of the temperature. In this case it is negative. The output of regression model for effective population and annual sales is shown here. What is the direction and magnitude of correlation between these two variables? The two variables are positively correlated because the coefficient for population is positive and its magnitude is 0.844, suggesting a fairly strong positive correlation between population and annual sales at a store. So we started this lesson with an obvious example of picking two variables that had no relationship. So I can explain how we can assess if we have a good model or not. So now you know if r square is small and the variables are not correlated, then the model is useless. But we can find the answer to this question in other parts of the output also. Look closely at everything that is displayed in the third table, especially the values for the independent variable, Temp. When you run regression analysis, you're building on the topic of hypothesis testing. Regression always assumes that the independent variables you have identified have no relationship. So if you get the small p-value less than the alpha of significance then you will be rejecting the null hypothesis. In the case of temperature and the test score, the p-value for temp is quite large, 0.9361. So we will not reject the hypothesis that temperature and test scores are not correlated. You also see that Excel is providing the confidence interval, 95% in this case. For the coefficient of the temperature, focus on that role. The interval is between -0.964 to 0.895. Zero is a value that's within the interval. So that could really be the coefficient for temperature. If you put zero as the coefficient of temperature, you would eliminate this variable altogether from the equation. So now, you see that we can tell this was a bad model through many different measures we have here. Last part of the output which I have not mentioned is the table In the middle for ANOVA. ANOVA stands for analysis of variance snd it's not part of the scope of this class. So I'm not going to be using this part of the output. However, I just want you to focus on one part of it and then we will move on. We learned that the total variations is the sum of these two sources. Regression and other sources we collectively call error. Here I have taken a small part of the Excel output to show you something from ANOVA part of it. Focus on the SS column, these values are for these notations. Using the formula for calculating our square and then taking the ratio of SSR to SSE, you get the same value that you see in the very first table for the r square. Now that you understand how to first build the model and then assess the significance of the model, we can move on to practicing this and learning how to build effective predictive models.