Next, we introduce a new measure adjusted R squared. We're going to talk about how to calculate this value as well as, what it means and how it use. We're going to use data from the US states on poverty as an example in this video, remember we had data from 50 states plus district of Colombia. The variables are percentage living in poverty in each state, percentage of residents living in a metropolitan area, percentage white, percentage of high school graduates, and percentage of female head of householders. Here, we can see a bunch of scatter plots and a bunch of numbers. So let's first pause for a moment and check to see what is going on here. In the first plot for example, our y axis is percentage living in poverty and x axis is percentage of metropolitan residents. The correlation between these two numbers is -20. Another plot we can take a look at here is in the intersection of metropolitan residents and female householders. So here the y axis is the percentage of residents living in a metropolitan area, and the x axis is the female householders. And the correlation coefficient between these two variables is on the opposite side of the matrix, 0.30. So, we can see scatter plots between each one of the variables that are involved in our dataset as well as the correlation coefficients on the lower half of the matrix. The sizes of the correlation coefficient vary by the magnitude. So, those that are highly correlated, either negatively or positively, but magnitude was highly correlated are noted using larger font sizes. And those that are not highly correlated are noted using smaller font sizes. We call such plots, paralyzed scatter plots, and they're very useful for an initial exploratory analysis of our data. Especially, if you have all numerical variables involved. We're going to start with a simple linear regression for this dataset, where we have only one predictor. The first step is going to be to load the dataset. . So, if you would like to follow along, you can do so by loading the dataset at this address. And next, we fit our model. We're going to call our model poverty simple linear regression or pov_slr. You can call it whatever you want. And remember we're using the linear model function the LM. And the first argument is our response variable and then atilda, that stands for versus, the explanatory variable, and the data file we're using is the file that we loaded earlier this states state data file. Let's take a look at a summary output for these data, and we can see that estimates for the intercept and the slope as well as many other statistics that might be useful in evaluating this model. On this slide, we can more easily see the scatter plot for the relationship between percentage living and poverty, and percentage female householder. The correlation coefficient between these variables is 0.53, and then our squared the percentage of variability in poverty explained by female householder is simply the square of that number or 28%. We also have our regression output cleaned up a little bit and rounded, and we got rid of some of the values that we don't need for the time being. And we can simply see the estimates, the standard error of these estimates, the t scores and the p values. And we can see with a small p value that female householder is actually a significant predictor of percentage living in poverty. We also mention that with linear models we can take a look at an ANOVA output, which allows us to partition the variability in our response variable. So the total measure of the variability in the response variable is sum of squares total that's 480.25. Remember, this is very similar to the variance of that variable except not scaled by the sample size, and we also can see how much of this can be attributed to our explanatory variable. Percentage of female head of householders, versus how much of it is unexplained by the model, that's the sum of square error. Or we can think about it as the variability that's left over still in the residuals. Using the definition of R squared, we can also confirm that this is simply the ratio of the explained variability to total variability. Remember, explained variability is the sum of squares of the regression, 132.57, and the total variability is sum of squares totaled, that's 480.25. And we actually get the same value for R squared as expected of 28%. Now that we have our base line model, we can add another variable to it, and let's start with percentage white. So we need to do in R is use the same linear model function. And add white as an additional predictor to our model, and then we can take a look at the summary output for this model, as well as the anova output for the model, and for that we use the function anova, that's wrapped around the regression model that we had specified earlier. That looks something like this, it's very similar to what we saw before except with an additional line in both of our tables for the new variable that we've added as a predictor. Note that the total variability, sum of square's total has not changed, because this is the inherent variability that is in our response variable percentage living in poverty. So, regardless of how many variables you're using in your model the total variability should not change. However, what has changed is how this variability is being partitioned. In this case, part of it is can be attributed to female householder, and a much smaller part of it can now be attributed to a percentage white. So if we wanted to calculate our square based on this output, and, keeping in mind that R squared is the percentage of variability in the response variable, that is explained by the model. And in this case, our model is comprised of two explanatory variables. We could calculate that as 132.57 + 8.21 to get us the total explained variability in the model divided by the total variability in our response variable which comes out to be roughly 29%. We can see that adding another variable to our model now explains one more percent of the variability in our response variable. The R squared used to be 0.28 and now, it's 0.29. The R squared value is going to go up each time you add a new predictor to your model. However, we need a more honest measure of whether the added variable is actually a useful one. And for that we introduce adjusted R squared. This measure applies a penalty to R squared, for the number of predictors included in the model, and the magnitude of this penalty, is going to depend on how k, the number of predictors compares to n, our sample size. The larger the sample size, the more predictors the model can handle, and therefore the less the penalties is going to be for additional predictors being added to the model. While R squared always increases with the addition of each variable to the model regardless of whether that variable is useful or not. Adjusted R squared is only going to increase if the added variable is actually of value. In other words, if the additional percentage of variability in the response variable explained by that new variable can offset the penalty for the additional number of predictors in the model. First, let's take a look to see how we can calculate adjusted R squared. Here we have the multiple linear regression model, predicting percentage living in poverty, from percentage of female house holders and percentage white. And remember that our N sample size was 51, that's the 50 states plus DC. So to calculate adjusted R squared, I simply find the ratio of the unexplained variability to the total variability, apply my penalty to that, and then we want to subtract that from 1. That is 1 minus 339.47 over 480.25 times 51 minus 1 divided by 51 minus 2 minus 1. 51 was our sample size, and k, the number of predictors is 2. Female householder and white, and this comes out to be 26%. Remember, our R squared was 29%, however, our adjusted R squared, with the penalty for the additional predictor, is only 26%. So in summary, for the first model were we simply had female house holder as our only predictor, our R squared was 28%. However, for the second model were we have the additional variable white, our R squared has increased to 29% while our adjusted R squared, which apply the penalty for this additional variable stayed at 26%. Remember, when any variable is added to the model, the R squared increases. However, if the added variable doesn't really provide any new information or is completely unrelated, the adjusted R squared does not increase. Some properties of the adjusted R squared, first, k the number of predictors can never be negative. Therefore, adjusted R squared is always going to be less than R squared. Second, adjusted R squared applies a penalty for the number of predictors included in the model. And third, we choose models with higher adjusted R squared over others. The decision criteria is based on adjusted R squared as opposed to R squared because R squared is always going to be higher for models with a higher number of predictors, but those may not always be the favorable ones.