In a linear regression analysis,

the accuracy of the model is assessed by the mean square error, or MSE,

which is the difference between the model estimated value of a response

variable denoted as Y hat, and the observed value of the Y response variable.

This difference is computed for each observation, where each

observation is denoted by the i subscript, and then the difference is squared.

Finally, the squared errors for all the observations are summed, then divided by

n, the number of observations, to get the mean square error.

There are two characteristics that we need to consider when selecting an accurate

statistical model, their variance and bias.

Variance refers to the amount by which the model parameter estimates would change

if we estimated them using a different training dataset.

If a method has high variance, then small changes in the training data can

result in large changes in the parameter estimates.

Ideally, the parameter estimates are stable across data sets

meaning that the method has low variance.

Bias refers to the error that is introduced by using a statistical model

to approximate a real life phenomenon.

It is a measure of how far off the model estimated values are from the true values.

For example, we might use a statistical model to try to predict whether or

not a person will become nicotine dependent.

In reality, there are lots of factors that lead to nicotine dependence.

Whether or not a person becomes nicotine dependent depends on a complex

interplay of multiple factors like genetics, behavior, attitudes, and so on.

By comparison, the statistical models we developed to estimate nicotine dependence

are very simple, and as a result they don't fully capture this complexity.

This in turn leads to biased parameter estimates.

What we would like to do is find a statistical model that

has both low variance and low bias.

The problem is that these two properties are negatively associated.

Increases in one result in decreases in the other, hence the bias,

variance trade-off.

Generally as model complexity increases,

the variance tends to increase, and bias tends to decrease.

Simpler models will be more stable across samples, meaning that they will have low

variance, but they are also likely to be more biased.

In this figure, you can see an example of a simple model and

a complex model fit on a training data set.

In the complex model on the right,

the observed values are all very close to the estimated regression line.

So the error rate is low, meaning that bias is low.

However, more complex models attempt to capture every pattern in the training

data set even those that occur by chance.

The chance patterns are specific to the sample on which the model is fit and

are not likely to exist in the test data set.

As a result, the model will not fit as well in the test data set

which means that the test error rate will be high in the test data set.

When a model has a small training mean square error but

a large test mean square error, the model is said to be overfitted.

When we overfit the training data,

the test mean square error will be very large because the patterns

that the method found in the training data set simply won't exist in the test data.

On the other hand, the simpler model on the left doesn't predict the observed

values as well, and as a result it has a high training mean square error.

This simpler model may be considered underfitted.

Basically, the simpler model ignores many of the patterns in the training data set,

which lead to increased bias.

The high mean square error means that it is not taking into account

many patterns that are likely to be real, so

is also likely to result in a high test error rate.

On the other hand, the simpler model is also likely to overlook random

sample specific patterns which means that variance will be low.

This figure shows how model complexity impacts training and test error rate.

In a really simple model, there's a lot of predication error.

Bias is high, but variance is low.

In this case the model is underfitted.

As model complexity increases, you can see that the prediction error, or

bias, decreases in the training sample.

Similar to the training sample, prediction error, or bias,

decreases, and variance increases as the model becomes more complex.

However you can see that there's a point which an overfitted or

increasingly complex model will actually increase the test error rate.

In this situation a model that is ove fitted on the training sample

leads to a low error rate in the training sample

at the cost of fitting poorly in the test data set.

The ideal model complexity is where the test error rate bottoms out.

The model at this point will have low bias and low variance,

both of which will provide the lowest possible test error rate.

Assessing model accuracy and the bias variance trade-off also applies to

situations in which we develop statistical models to classify observations

into different levels of a categorical outcome variable.

Logistic regression is an example of a classification model.

In logistic regression, model accuracy is determined by how well a logistic

regression model developed on a training data set, correctly classifies

observations on a categorical outcome variable in a test data set.

The same bias variance principle applies in this case, such that a logistic

regression model with low bias and low variance will have a low prediction or

classification error rate in the test data set.

For example, we might develop a logistic regression model to predict whether or

not a person is nicotine dependent.

The statistical model developed on the training sample

can be applied to observations in the test sample to see how accurately the model

classifies observations by comparing model predictive nicotine dependence diagnosis

to actual diagnosis for observations in the test data set.

A confusion matrix,

like the one shown here, can be used to estimate prediction accuracy.

A model with low prediction error will have a high percentage of correctly

classified observations, and a low percentage of misclassified observations.

In this example, the training data statistical model incorrectly classified

a total of 123 of the 992 observations in the test sample, meaning that

the statistical model misclassified 12% of the observations in the test data set.