Let's now go back to the HR example we explored in the last module. If you remember, we wanted to see the relationship between a bunch of variables in an outcome, namely whether the employee left the company or not. To do so, we built a logistic regression model, which performed pretty well on the in sample data. Where we're going now is to use the model on the employees who are currently still with us to see the probability that they are leaving in order to undertake actions aimed at retaining them. As always let's set our current working directory to the folder where have downloaded the original HR data set that we explored in the last module. And the new HR data set with the additional observations. Once this is done, we run this line to clean up the memory of our current R session and we are now ready to load both of our data sets with the read.table function. You already know the argument, the header equal true, and the separator equals comma, my input. So let's load the data, old data set, which is the same that we used in the last module. And the data, new data set, which contains only the employees who are still currently in the company. Like in the previous video, if you do not remember the data set from last module, you should spend time exploring it again with the STR function and with the summary function. Let's go over quickly the data, new data set, in order to get more familiar with it. There are a thousand observations in this data set and the six variables that you already had in the data set that we used last module, the data, old data set. So we have the satisfaction level, the last project evaluation, the number of project worked on in the last 12 month, the average number of monthly hours, the time spent in the company. And whether or not the employee had a new born within the last 12 months. If we now run the summary function on data new, what do we find out? First we find out that about 20% of our employees had a newborn within the last 12 months, so that might be something that we want to keep in mind when designing employee's perks or working on an employee's well being. Then we find out that on average our employees worked 100.2 hours per month. And that they have spent on average, a little bit over three years in the company. Let's now rebuild the logistic regression model that we built in the last module. You remember that in order to do so we use the GLM function and our first argument is dependent variable which here is the left variable then tilt it. And then the dot indicates that we want all the other variables to be used as independent variables. We then set family argument equal to binomial and in parentheses, logits. And the data that we want to use in order to build the model and compute the coefficients is the original data set, which is data old. We already assessed that the model was performing well, so we can move on to using it to make predictions for out-of-sample data. Like in the previous example, we are going to use the predict function and again our first argument is the model that we call lugreg, the second is the new data that we want to use the model on. And in our case it's data new. And then the type of predictions that we want to make and in our case, we set it equal to response in quotes. Now that we have stored the output of the predict function in PROBATOLEAVE, let's go PROBATOLEAVE in order to see the output. What we see is that for each of our 1,000 observations, we get a probability. Let's now organize the output nicely in a data frame by using the data.frame function on PROBATOLEAVE. And we store the output in predattrition. Let's see the data frame by using the view function. Now, we get the same thing but in a nice table which will be easier to use. Let's go back to the script. At this point we can use the information in a lot of different ways. But let's focus like you did in class on using it to prioritize our efforts to our best performers. So the first thing that we want to do is to add the performance variable to our predattrition data front. To do so, you can type predattrition dollar sign performance. And you set it equal to data new dollar sign LPE, which is the last project evaluation and a fairly good measure of performance. Now let's view the output with the view function. You can see now that for the first observation, there is a probability and a performance. Let's now go back to our script, we can build a plot with the performance on the vertical axis and the attrition probability on the horizontal axis. This is what we get. If we wanted to make the output more visual, we could add vertical and horizontal lines with the AB line function like we did before. And maybe add some text labels using the text function. And this is what I'm going to do with the code that I'm going to copy paste here. You can explore the code here on your own, but essentially, we can see here very visually that these are the employees that we want to retain. Their likelihood to leave is fairly high. That is the probability to leave and their performance level is also pretty high. So we definitely wanna target our actions to these people. Now can you think of a way of prioritizing among those employees in a much easier way? As explained in class we could establish a priority score by multiplying performance and the probability to leave for each employee. So we can add a priority column like you see here to our predattrition data frame and the priority would be the multiplication of the performance for each of the employees, each observation and their probability to leave. Let's run the line. Now if we view the output with the view function, this is what we get. For each observation we have their probability to leave, their performance, and then their score which is a result of the multiplication of their probability to leave and their performance. Now let's turn back to our script in order to make our work a little bit easier. What we can do is to order the output in decreasing order. And you can do that using the order function and setting the decreasing argument equal to true. So what we're doing here is that we're ordering from the employee with the highest priority score to the one with the lowest priority score. Because we're ordering by priority scores. Now to do so we create a new data frame called Order Predattrition and we run the line that we just explained here. The order pred attrition data frame contained all the data from predattrition but the organized in decreasing order of the priority scores thanks to the call to the order function. Now let's check out the output with the view function called on the order predattrition data frame. And now we see that the employee with the highest priority is ranked first, so based on our resources now we'll be able to decide which employees we want to target our efforts on in this order. Maybe it'll be the first two, maybe the first five, maybe the first 20, depending on the company and the resource that we have. But this gives us a really nice way to prioritize our efforts. So that's it for this example, and I will see you in the next video.