In this video, we discuss how to assess predictive accuracy of a linear regression model using cross validation. Predictive accuracy is sometimes viewed as an automated task of the accuracy of a predictive model. Therefore, how we evaluate the predictive accuracy of our predictive model is crucial. In classical statistics, we evaluate models with goodness of fit, which measure the precision error on the original dataset. It usually relies on some statistics, such as p value, or r squared as we will discuss in another video. When evaluating predictive accuracy, however, we are primarily interested in how the model will perform on data not seen by the model. After all, for the data we use to build a model, we already know the values of the target variable. A model is useful for a prediction only if it can make prediction on data not used to build a model. It is reasonable to argue that prediction accuracy is a fairer assessment of models, because we have to reserve some data for model evaluation and cannot use it when building the model. We need to have more data, it is not an issue when we have plenty of data, but can be tricky when our dataset is relatively small. Using predictive accuracy to evaluate the model is not an entirely new idea. Professionals working on various forecasting tasks have been using this for a long time. And it has become more prevalent in recent data science development. As we shall see throughout the course, it is used extensively in predictive modeling. Before assessing prediction accuracy, we randomly divide the data into two parts, this process is called data partitioning. One portion of the data is designated as training data, which is used to fit the models. The other portion which is not used to fit the models is used to evaluate the different models we build. This portion is called the validation set. How exactly do you measure prediction accuracy? The answer depends on the types of predictive modeling. It is different for regression as oppose to classification. We will discuss classification in a different video. Here, we briefly discussed how we measure prediction accuracy for regression. Recall that one way to measure the error for a particular observation is a concept of residual which is observed value minus predicted value. We can extend that idea a little bit to construct a measure of prediction accuracy. We apply the concept to the validation data by calculating the value for all rows in the validation data to come up with a single measure. We square each residual, and sum them all up. Note that the square operation is needed here because otherwise part of the errors cancel out. The measure which is described is called sum of squared errors, which it seem to be an appropriate way to measure prediction accuracy. There are other ways of representing the same information. One alternative is RMS error, or root mean square error, which is the square root of the sum of squared error divided by the total number of rows in the valuating data. Now let's look at the cross validation results for the housing data. We first divided the data into training and a validation sets, where we take 60% of the data or 188 rows as the training set and the rest belongs to the validation set. We use a training set to feed as a model and we use validation set to calculate measures of prediction accuracy. In this case, the sum of squared errors is 39326517 and the RMS error is 558. Both of these errors can be used to compare different models, where smaller values are prefered. It is interesting to point out of that 59 is slightly different from the one that we saw before. In particular, the intercept at b0 is now -105.95 and the slope b1 is 0.4619. This is because here, we only use the training set to feed the line. In general, when we use different subsets of data to feed the model, we will get a slightly different coefficient estimates. Therefore, the cross-validation result likely depend on how we submitted the data into training and the validation sets. One way to deal with this issue is to perform n-Fold cross-validation. In n-Fold cross-validation, we divide the dataset by row into n equally sized sets and run the estimation validation process n times. For each of the n runs, we use one set of the data as a validation data and all other sets as the training data. So overall performance measure is averaged across all n runs. In this way, we optimize the performance measure that tends to be more reliable. Here is a graphic illustration of a 5-fold cross validation where n = 5. We divided the data into five approximately equal sized portions each called a fold. In each round, we take one fold as a validation set and the rest of the data as a training set. I would like to point out that it is common to take n=5 or 10. That is, we typically perform five or ten-fold cross validation. Theoretical studies and empirical results show that they yearly perform better.