At this point, we've learned how to test a multiple linear regression model,

and how to evaluate the fit of the model based on the significance of their

aggression coefficients, and their confidence intervals.

And by the R square, which is the amount of variability in the response variable

that is explained by our explanatory variables.

However, we should further evaluate our aggression models for

evidence of misspecification.

Specification is the process of developing a regression model.

If a model is correctly specified than the residuals or

error terms, are not correlated with the explanatory variables.

If the data failed to meet the regression assumptions, or if our model is missing

important explanatory variables, then we have model specification error.

We perform regression diagnostics to try to understand the cause of

the misspecification, so that we can try to address it.

We can assess violation of the assumptions of the linear regression analysis,

by examining model residuals.

That is, we can take a closer look at the e in our regression formula,

which is the error, or residual estimate.

There are many regression diagnostic procedures to choose from.

In this course, we will focus on examining residual plots,

in order to visually evaluate specification error.

First, let's add another centered explanatory variable, internetuserate,

to our regression equation.

Internet use can be considered an indicator

of a country's level of modernization.

Here's the regression equation for this model and the python code.

This is the same gap minder model that we tested previously with the exception that

we have added the centered internetuserate explanatory variable.

And here are the results.

We haven't yet discussed the interpretation of the intercept in detail.

The intercept is the value of the response variable,

when all explanatory variables are held constant at zero.

Because we centered our two explanatory variables, so

that the mean for each variable was equal to zero, the intercept

is the female employment rate at the mean of urban rate, and Internet use rate.

So, the female employment rate when urban rate, and

Internet use rates are at their means is 44 out of every 100 women.

There is also a show that coefficients for the linear, and quadratic

urban rate variables, remain significant after adjusting for Internet use rate.

Internet use rate is also statistically significant.

The positive regression coefficient, indicates that countries

with a high rate of internet usage, tend to have a higher female employment rate.

Each observation has an estimated response value.

Which is also referred to as a predicted or

fitted response value, based on the regression equation.

But we know, that this equation does not estimate the observed response value for

that observation perfectly.

In fact, urban rate and Internet use rate together,

explain only about 18% of the variability in female employment rate.

So, there's clearly some error in estimating the response value with

this model.

In this regression model, the residual is the difference between the expected or

predicted female employment rate, and the actual observed female employment rate for

each country.

We can take a look at this residual variability,

which not only helps us to see how large the residuals are, but

also allows us to see whether our regression assumptions are met.

And whether there are any outlying observations,

that might be unduly influencing the estimation of the regression coefficient.

The easiest way to evaluate residuals, is to graph them.

FIrst, we can use a qq-plot to evaluate the assumption that the residuals from our

aggression model are normally distributed.

A qq-plot, plots the quantiles of the residuals that we would theoretically see

if the residuals followed a normal distribution, against the quantiles for

the residuals estimated from our aggression model.

The python code to generate a qq-plot is here.

First we create an object called fig1, that will be our qq-plot.

Then an equal sign followed by sm.qqplot which

calls in the qqplot function from the stats model library.

In parentheses we include reg3.resid where reg3 is the object that contains our

multiple regression results, and resid, r-e-s-i-d contains the model residuals.

Then, we add a comma, and line equals r with the r in quotes

which tells Python to generate a red linear regression line on the plot.

What we're looking for is to see if the points follow a straight line.

Meaning that the model estimated residuals are what we would expect

if the residuals were normally distributed.

The qqplot for our regression model shows that the residuals

generally follow a straight line, but deviate at the lower and higher quantiles.

This indicates that our residuals did not follow perfect normal distribution.

This could mean that the curvilinear association that we observed in our

scatter plot may not be fully estimated by the quadratic urban rate term.

There might be other explanatory variables that we might consider including in

our model, that could improve estimation of the observed curvilinearity.