In this video we discuss how to assess predictive accuracy of a linear regression model using cross validation. Predictive accuracy is sometimes viewed as automate task of the advocacy of a particular model. Therefore how we added the predictive accuracy of predictive model is crucial. In classical statistics we value the models with goodness of fit, which measures the provision error on the original data set. You already realized on some statistics, such as p-value or r-squared, as we discussed in another video. When evaluating predicted accuracy however, we are primarily interested in how the model will perform on data not seen by the model. After all, from the data we used to build the model, we already know the values of the targeted variable. A model is useful for prediction only if it can make predictions on data not only used to build the model. It is reasonable to argue that prediction accuracy is a fairer assessment of models. Because we have to reserve some data for model evaluation and cannot use it when building the model, we need to have more data. It is not an issue when we have plenty of data, but can be tricky when our data set is relatively small. Using predictive accuracy to evaluate models is not an entirely new idea. Professionals working on [INAUDIBLE] for counting tasks have been using it for a long time, and it has become more prominent in recent data science development. As we shall see throughout the course, it is used extensively in predictive modeling. Before assessing prediction accuracy, we run the [INAUDIBLE] data into two parts. This process is called data partitioning. One portion of the data is designated as training data, which is used to fit the models. The other portion, which is not used to fit the models is used to evaluate the different models we build. This portion is called the validation set. How exactly do we measure prediction accuracy? See, the answer depends on the types of predictive modeling. It is different for regression as opposed to classification. We're going to discuss classification deep in the video. Here we'll briefly discuss how we made our prediction accuracy for regression. Recall that when we can measure error for a particular observation, is a concept of residual. Which is observed value minus predictive value. We can extend that idea a little bit to construct a measure of big accuracy. We apply the concept from the validation data by calculating the value for all All rows in the valence data. To come up with a single measure with square each residual and sum them all up. Note that the square operation is leaded here because otherwise positive and negative errors cancel out. The measure we just described is called sum of squared errors. Which seem to be a perfect way to measure prediction accuracy. There are other ways to represent the same information. One alternative is RMS error, or root mean square error. Which is the square root of the sum of squared error, divided by the total number of rows in the validated data. Now lets look at the cross validation results for the housing data. We first divide the data into training and validation sets, where we take 60% of the data or 188 rows as the training set. And the rest belongs to the validation set. We use the training set to feed the model, and we use the validation set to calculate measure of prediction accuracy. In this case the sum of square errors is 39326517 and the RMS error is 558. Both of these measures can be use to compare different models where smaller values are preferred. It is interesting to point out of that the PT9 is slightly different from the one we saw before. In particular, the intercept of b0 is now minus 105.95. And the slope b1 is 0.4619. This is because here, we only use the training set to feed the line. In general, we use different subsets of data to feed the model, we look at a slightly different coefficient estimate. Therefore the cross-validation result likely depend on how we split the data into training and validation sets. One way to deal with this issue is to perform what we call n-fold cross-validation. In n-fold cross-validation, we divide the dataset by row into n equally sized sets, and run estimation validation process n times. For each of the n runs, we used 1 set of the data as a validation data and all other sets as a training data. The overall performance measure is averaged across all n runs. In this way, we opened up performance measure that tends to be more reliable. Here is a graphical illustration of five-fold cross validation where n equal to five. We divided the data into five approximately equal size the portions, is called a fold. Each run we take one fold as a validation set and the rest of the data as a training set. I would like to point out that it is common to take any through five or ten. That is, we typically perform five or ten fold cross validation. Theoretical studies and empirical results show that they usually perform better.