Welcome back for this third example. In this video we're going to go over the predictive maintenance example that you explored in class. As always, we'll focus on learning how to perform a survival analysis in arc in order to obtain the same results shown to you during the lectures and most importantly for you to be able to perform your own survival analysis on your own datasets. Bear in mind that our issue in this example is to estimate the remaining lifetime of our PCs, so we can better organize our maintenance efforts. All right, let's get started. We set our working directory to the folder where we have downloaded the predictive maintenance dataset, and we then run this line in order to clean up the memory of our current arc session. We're now ready to load our maintenance dataset and coiled data. And we load it using the read.table function using the set argument and the header argument. Now, let's get familiar with our data by calling the str function. And we find out that we have 1,000 observations in this dataset, and seven variables. Namely, the livetime variable, which reports in weeks, how long it has been used until now. Then the broken variable, which takes the value 1 if the piece is broken, and 0 otherwise. Then the pressure index, the moisture index, the temperature index, the team that is in charge of maintenance, and the provider that supplied the piece. Let's now take out some summary stats on our dataset to understand the order of magnitude of our variable, and getting better integration of our data. First thing that is of interest is the proportion of the PCs of observation that are broken. And we see that we have 0.397. So about 40% of the PCs in this dataset are broken. And then, we may be interested to know that the mean pressure is 98.6, mean moisture index is 99.38, and mean temperature index is 100.63. The last thing that may be of interest, is to look at the breakdown of PCs maintenance allocation to teams. 336 PCs are allocated to team one, 356 to team B, and 380 team C. One thing that we've done before in order to predict the value of a numerical variable is to build a linear regression model. So let's try it out here. Let's call our model lin reg model and we use the ln function. Our dependent variable is lifetime and then the dot indicates that we use all the other variables and then the minus broken indicates that we use all the variables, minus the broken variable. But we could add all those variables to the model after the tilde and separate each of them by a plus sign. But again, it is just more efficient. And I want you to know that you can do that, if you want to. And then we set data equal to data to use our data dataset. Let's run the line. Now, let's check out the output of our model. What we see is that It looks like the fact that team C which was in charge of the maintenance of the piece is statistically significant, as well as the fact that the piece was provided by provider number three. And both of them seem to impact negatively on the lifetime if we look at the co efficient and its sign. But as was explained in the lecture, using the output of a linear regression, this case is inaccurate. In this case, we cannot rely on the output of the linear regression model for a very simple reason. For 60% of the observations, those are not broken, the lifetime is not the live time until the PC was broken, but rather, the lifetime until now. So the truth is, we don't really know what the final value of lifetime will be for PCs that are not broken yet. All we know is that they are still working until now. As Professor Glady explained in the lecture, this is a right-censored problem that we will tackle in this video by using a survival model. In order to create such a model in R, we need to install the survival package, and in order to do so, we need to use the install.packages function, open parenthesis, and survival in quotes. We read the line, and it may take a few seconds. Now we need to load the survival package by using the library function, open parenthesis, and survival. Without quotes this time, we run it, and the package is loaded. Now, please note that while you will not have to reinstall the survival package for your subsequent analysis, you will need to load the package every time you start a new session. In any case, we're not ready to start our survival analysis. Our first step is to set our dependent variables and or we can do that by using the Surv function, and in this case, we want to provide the lifetime and whether or not the PC is broken. And we call the output dependentvars. We're now ready to build a model that we will call servreg. To do so, we use the servreg function. We input our dependant vars, which is the output of the previous line, tilde, and then our independent variable, namely pressure index, moisture, temperature index, team provider. We then set the dist argument equal to Gaussian, in quotes, because we want distances to be computed using the Gaussian method. And as usual, we set data equal to data, which is our dataset. We then run the line. At this point, we should assess our model on out of sample data for example. We won't do it here because we checked it for you. But keep in mind that you should not use a model without being confident that it does well, or at least without knowing its limits. Let's now check the output of the model. As we always do, let's check out the p value to see which variable is statistically significant. We see that first, the moisture index is statistically significant. The temperature index as well. If the PC was maintained by team C and if the PC was provided by provider two, three or four are all statistically significant. Now, to assess whether the effect is positive or negative on the expected lifetime, we look at this sign of the coefficient. For our moisture index it's positive. For temperature index it's negative. For Team C it's negative. For Provider two It's positive. And for provider three it's negative and provider four, it's positive. When I say positive or negative, it is on the expected lifetime that I'm talking about. So we see that if the team was supplied by provider two or provider four, the expected lifetime is positively impacted. But in any case, it is information that is worth knowing, and that is worth investigating. Now that we have a working model, let's go back to our initial problem, which was to estimate the remaining lifetime of our PCs, which are not currently broken. To do so, we're going to make predictions using the predict function, which will output the expected median of each individual element. Our first argument is the model that we just built and that we're going to use to do our predictions. Our new data argument is set equal to data which shows that our dataset. And we set the type argument equal to quantile, in quotes, and P for percentile, equal to 0.5 in order to get the expected median time today. And we store the output in EBREAK. Now to make an interesting report, we can organize the output of EBREAK in a data frame, with the data frame function. And we call the output Forecast. We can then add a column with the lifetime for each observation by typing Forecast$lifetime=data Data$lifetime, and then a column indicating whether or not the piece is currently broken with forecast$broken = data$broken. And the last column, which we will call remaining LT, for remaining lifetime, which computes the remaining lifetime by subtracting the lifetime variable from the original dataset to EBREAK, which is the output of our predict function. Let's check out the output by typing View. Now the output is interesting, but we see that we still have the PCs that are broken, and that we're not really interested in, since it's too late to do some maintenance on those. And the other thing is that our output is currently not prioritized. We don't have the PCs that will be broken shortly that we should focus on. What can we do to improve the output? First, we can reorder our forecast data frame and order it by increasing remaining lifetime by using the order function. And from forecast, we can select only the PCs that are not broken yet. That is those for which the broken variable is equal to zero. Now let's take out the output of ActionsPriority, which is our final dataset. That's much better. We can now allocate our maintenance staff to the PCs most likely to break soon, and maybe avoid some breakdown timeout. So that's it for the third example. I will see you in the next video.