[MUSIC] We can use the techniques, seen during the previous module to identify the observations that are the most likely to demonstrate certain behavior in the future. Thanks to the classification regression techniques, we have built a model relating causes and outcomes based on past data. We can therefore plug in to this model new data that was not used to estimate the parameters of the model. That is what is called Out-Of-Sample-Data, and we can consequently predict the expected outcomes for those new observations thanks to the original model. First, let's take again the example of credit scoring. We can estimate again the model based on the same data as what we did previously, but then we can use the estimated model on new dataset - out of sample that hasn't been used to estimate the model - and make predictions. As you'll see during the recital, once we have estimated the model and collected additional data, we can use the function "predict" in R to apply this model to this new data set. In this particular case, since we know what should be the credit score, we can compare the predictions with the actual values. We achieve a correlation of 98%. And when we look at the results visually with a plots, here with all the fitted values as a function of the actual ones, we see that the predictions are indeed very close to what they should be. Now what could we do with this in practice? Many things actually. We can identify the applicants who do not have credit score yet but that should be acquired as clients. Or the opposite: avoid including bad credits. You see that once we have estimated the model, we can predict the outcome of interest for any additional observation as long as we have the explanatory variables used in the model. Let's now take the HR analytics example again. It will be the last time we use it. We can apply exactly the same approach here. We can use the predict function on the output of the GLM function that was used to conduct a logistic regression. Here all the employees will be focusing on our analysis. So, the prediction needs to be seen as a probability that those employees will leave the company in the future. We then have for each employee in our sample a probability like this. What can we do with it? We could, for instance, prioritize our actions and if we do so we need to keep in mind that the probability informs us on the employees that are most likely to leave. But we also need to add to the analysis information about how much we want to retain those employees. As we've seen before in the dataset, we have information about the performance of the employee thanks to the evaluation variable. Let's add it to the analysis and create a table with those pieces of information, probability to leave and performance. We can then work on a 2x2 matrix. We first have the employees that are underperforming. We should improve their performance. Then we have those who are performing okay, and that we should retain. Among those we want to retain, some employees are not likely to leave soon. So we should manage them as usual. And then, on the short run, we should focus on those with a good performance and a high probability to leave, to assess the visual priorities for each employee in a quantitative way. We could for instance, multiply the probability to leave with the performance we then have a priority score since the result of this product will be high for the employees that we want to act on quickly, and low for the other ones. So if you rank our employees by this priority score, we obtain something like this. The first column reporting the ID of the employee, the second, the probability to leave, the third, the performance. And finally, the product of the latter two, the priority score, which will allow us to prioritize our actions. Now, we have a rank of the employees that we want to retain, the score being the combination of who is likely to leave and how much we want to retain them. Tomorrow, we can organize a face to face meeting with the first one - employee 9 2 8 - then the second one - employee 5 8 8 - and so on. This type of prioritization is really useful in practice. We could also have identified quick wins thanks to this approach: employees that are high performance that were likely to leave very soon, but now that we have identified them, we'll retain them. This type of analysis is also used a lot to prioritize marketing actions such as retention or in the case of propensity- to-buy or cross-sell and up-sell modeling. In such a context the probability interest would be the probability that customer buys the product - which can be estimated thanks to a similar model as what we've seen in the HR case. And the equivalent of the performance variable could be the expected margin, for instance, of this product. But using a classification technique as a prediction tool has a major flaw. We may know whether an event is likely to happen in the future, but we don't know when, exactly. So let's now discuss techniques allowing to predict the expected time for a certain event.