Hi guys. Welcome back for this modules recital. This week we will add some protection and forecasting tools to our current tool box. Last module, we tried to understand the relationship between causes and consequences. This week we're going to focus on using the knowledge build last week. In particular, our ability to build classification and regression models, in order to make predictions on out of sample data, that is, data that the model has never seen before. As usual, we'll review all the examples covered in class, and explain how to use R to obtain the same outputs. Let's get started with the credit score example we saw last module. As we always do let's set our directory to the folder where we have downloaded both the original credit scoring data set that we explored in the last module and the new credit score and data set with the additional observation. For me this is this folder and I set is as my working directory. We then run this line in order to clean the memory of our current session. And we then load the data set with the read.table function and call the original data set dataold, and the new dataset, datanew, with the new additional observations that the model that we built last week has never seen before. Now if you do not remember the data set from last module, you should spend some time exploring it again with the STR function and the summary function. Let's go over the datanew data set very quickly. If we run the STR function, we see that there are 100 observation in this data set and 10 variables in this data set which are the same as is contained in the old data set, namely the income variable, the rating variable, the cause variable, the age variable, the education variable, the gender variable, the student variable. And Married variable, the Ethnicity variable and the Balance variable. Let's run the summary function on our datanew. What can we learn on our variables in this dataset? First, we see that the minimum income is $10.730. The maximum is $182.73 in thousand U.S. dollars and the average is $48.71 in thousands of U.S. dollars. Now the rating variable has a minimum value of $112. A maximum value of $982 and an average value of $375.4. You should probably compare those numbers to the ones obtained in the original data set. In any case, this gives us a good idea of the order of magnitude of the rating variable here. Now, like we did last model, let's build a linear regression model with the rating variable as our dependent variable and all the other variables as independent variables. Here we used a dot, and the data we use in order to build the model is our dataold data set, which is again the same data set that we used in the last model. Let's run the line. Now that we have a model that we call linreg and they we assess how it performed as we did in the previous model. We're ready to test the model on out of sample data. What we want is the model to give us predicted credit scores for the observations of our new dataset, the one that we called datanew. Again, remember that this is data that the model has never seen before, NR. In order to obtain predictions made by the model on new data, we used the predict function. The first argument is the model that you wanna use. For us, it's linreg. Then the new data argument is set equal to our new data which in our case is the datanew as a data set. Finally, a last argument allows us to choose the type of prediction we wish to make. Here we set type equal to response in close in order to obtain the predicted credit score. We then store the input in a new variable called predcreditscore. Let's run the line. Now let's go to the predcreditscore variable in order to see its output. In order to do that you can highlight the name of the variable and then press Cmd+Enter on the Mac and Ctrl+Enter on a PC. What the predict function has done is to predict a credit score for everyone of 100 new observations and it did so using the model dblock on the in sample data. Now let's see how the model performs using correlation between fitted and actual values on the dataold data set. As we did in the previous model, we use the core function. The fitted value is first and the actual value second. We obtain a 98% correlation, more precisely 0.9867324. Now let's have a visual look at these results. By using a plot which has the fitted values on a vertical axis, and the actual values on the horizontal axis. Again, you see that the model does pretty well, although not as well on the lower values. Now that we know how the model does on in sample data, let's check out how it goes on out of simple data. Here, again, we use correlation. The first one here is our predictions and then we have the actual values by typing datanew$rating. We get an an even stronger number of about 99%, more precisely 9880, more precisely 0.9880. Again let's see the output visually by using a plot. We have less data point cuz we only have 100, but we see that the model does a pretty good job. And here again, we see that for lower values it doesn't do as good of a job. What's interesting here is that if we had more observation, or in this case, more applicants for learned, for which we had values for all the explanatory variables. We could give them a credit score and assess whether or not it makes sense to grant them a loan, from a business point of view. So that's it for this example and I'll see you in the next video.