We're going to wrap up this unit on introduction to linear regression with a discussion on variability partitioning. So far within the framework of regression we've used a t-test as a way to evaluate the strength of evidence for hypothesis test for the slope of relationship between x and y. Alternatively, we can also consider the variability in y explained by x, compared to the unexplained variability. Remember that percentage of variability in y explained by x was our R-squared. Remember also that we like large R-squareds so we wonder, could we use that notion to also do this hypothesis test from another point of view? This idea of partitioning the variability in y to explained and unexplained variability measures should not be new to you, remember that we had seen that when we first discussed analysis of variance or ANOVA. We can actually get an anova type output for our regression model as well. So this type of output should be familiar to you from before. Let's go through it one more time and take a look to see how do these numbers relate to what we know about anova from before where we had used it to compare means of multiple groups to each other. One of the important columns here is the sum of squares where we have data on the total variability. So total variability in y is basically the sum squares total and this looks very much like the variance of y if we didn't scale by the sample size. What we mean by unexplained variability in y within the context of regression is basically the sum of squares of residual. So imagine that you have the residual for every single data point in your data set and you square those and add those up. Then the explained variability simply becomes the balance of these two numbers because remember that the explained plus the unexplained variability will get you to the total variability. Next, lets look at the degrees of freedom column. The total degrees of freedom is simply your sample size minus one, this is the twin data that we're still working with. So we had 27 twins minus 1 gives us 26. Next we consider the degrees of freedom associated with a regression. And since we only have one predictor here, this degrees of freedom is simply going to be 1. Then the residual degrees of freedom is the balance of these two, 26 minus one, 25. In the next step, we want to have a measure of the average variability that we call mean square and remember to get to mean square we take the sum of squares and divide them by the associated degrees of freedom. So the mean square regression is the sum of squares regression divided by the degrees of freedom of regression. And the mean square of residuals is simply the sum of squares residuals divided by the degrees of freedom associated with the residuals. Finally we get to the F statistic. Remember that the F statistic is the ratio of explained to unexplained variability In this case, that's going to be the Mean Square of Regression divided by the Mean Square of Residuals. Now that we had a refresher on the ANOVA table, we can actually move on to doing our hypothesis test. Remember our goal was to see, is the explanatory variable a significant predictor of the response variable? And we had set our hypothesis as the slope equals 0 for the null hypothesis. And the slope does not equal 0 for the alternative hypothesis. We have a pretty small p-value, meaning we would reject the null hypothesis and in this case rejecting the null hypothesis means that the data provided convincing evidence that the slope is significantly less than 0. In other words, the explanatory variable is a significant predictor of the response variable. Now that we've been talking about the variability and the response variable and partitioning this variability to explain the unexplained variability Let's revisit this notion of R-squared one more time. Remember that R-squared is the proportion of variability in y explained by the model. If this value is large, we then say that there's likely a linear relationship between x and y. If the value, on the other hand, is small, we say that the evidence provided by the data may not be very convincing. There are actually two ways to calculate R-squared. We've already seen one of them using the correlation coefficient, we simply take the square of correlation coefficient. But another one is actually from the definition of R-squared. We can directly calculate it as a proportion of explained to total variability. Now, that we've seen the anova table, and we know the measures of total variability and explain variability, taking the ratio should be a simple task. So let's go ahead and quickly check if these two methods actually yield the same result. If we square the correlation coefficient we get an R-squared of roughly 78%. To do the calculation using the definition of R-squared, we need to take the ratio of explained variability to total variability. Remember that this means some of squares of regression divided by some of squares of total. And doing the math we actually get to the same results. 78% of the variability and foster twins IQ's can be explained by the model, or in other words, the biological twins IQ's.