Now we are ready to run our LASSO regression analysis with the LAR algorithm using the LASSO LarsCV function from the sklearn linear model library. LASSO Regression has a couple of different model selection algorithms. In this example, I will use the LAR Algorithm, which stands for Least Angle Regression. This algorithm starts with no predictors in the model and adds a predictor at each step. It first adds a predictor that is most correlated with the response variable and moves it towards least score estimate until there is another predictor. That is equally correlated with the model residual. It adds this predictor to the model and starts the least square estimation process over again with both variables. The LAR algorithm continues with this process until it has tested all the predictors. Parameter estimates at any step are shrunk and predictors with coefficients that have shrunk to zero are removed from the model and the process starts all over again. So I create an object called model that will contain the results of the Lasso regression analysis, then I type the name of the function, LassoLarsCV. In parentheses, I add cv = 10, which asks Python to use k-fold cross-validation with ten random folds from the training data set to choose the final statistical model. The option precompute=False tells Python not to use a precomputed matrix. This option would be helpful with very large models, because the precomputed matrix can speed up calculations. However, the model I'm going to test is not very large. To fit the Lasso regression on the training set after the parenthesis, I add .fit, and in another set of parenthesis the names of the training predictor and response variable data sets. So what I am doing here is using k-fold cross-validation in which the first fold of the training data set is treated as a validation set, and the model is estimated using the remaining nine folds. At each step of the estimation process, when a new predictor is entered into the model, the mean-square error for the validation fold is calculated for each of the other nine folds and then averaged. The model that produces the lowest mean-square error is selected by Python as the best model to validate using the test dataset. Let's go ahead and run the code and take a look at the results. The first thing we can do is tell Python to print the variables, and the regression coefficients from the model that was retained by the model selection process as the best fitting model. The dict object creates a dictionary, and the zip object creates lists. So here I typed the dict object And in parenthesis, I add the zip object, and then in the next set of parenthesis I name the two lists that I want printed. In this case, I want to print the variable labels and the associated regression coefficients. I do this by first typing the name of the data frame that has the predictor variables, which I named predictors earlier, and then .com, which will provide the variable labels. Then I add a comma and the name of the attribute from the LASSO model results object that I named model.coef_, which is the name of the regression co-efficient attribute. We can run the code and take a look at the output. Predictors with regression coefficients that do not have a value of zero are included in the selected model. Predictors with regression coefficients equal to zero means that the coefficients for those variables had shrunk to zero after applying the LASSO regression penalty, and were subsequently removed from the model. So the results show that of the 23 variables, 18 were selected in the final model. If you remember, we standardized the values of our variables to be on the same scale. So we can also use the size of the regression coefficients to tell us which predictors are the strongest predictors of school connectedness. For example, self-esteem and depression had the largest regression coefficients, and were most strongly associated with school connectedness, followed by black ethnicity and GPA. Depression and black ethnicity were negatively associated with school connectiveness, and self-esteem and GPA were positively associated with school connectiveness. We can also create some plots so we can visualize some of the results. For example, we can plot the progression of the regression coefficients through the model selection process. In Python, we do this by plotting the change in the regression coefficient by values of the penalty parameter at each step of the selection process. It's important to note that the sklearn library refers to the penalty parameter is alpha although the more conventional term is lambda. We can use the following code to generate this plot. For creating the plot, I will apply a negative log10 transformation to the alpha values. Simply to make the values easier to read by creating an object m_log_alphas that is equal to the negative of the -np.log10 transformation function applied to the alphas_ attribute from the model results object. The alphas_ attribute contains the values of alpha through the model selection process. The first line of code for the plot, sets up the axes, the second line of code asks Python to use the plot function from the mat plot lib library which we imported as plt to plot the transform values of alpha on the horizontal access. And the change in the regression coefficients in the coef_path_ attribute, from the model results object and the y axis. The .T asks python to transpose the coef_path_ attribute matrix to match the first dimension of the array of alpha values. I will use the plt.axlvline function to put a dashed vertical line at the -np.log10 transformed alpha value for the selected model. The color equals='k' in quotes tells Python to make the line color black. Finally, I add titles for the two axes and the plot title and run the code. This plot shows the relative importance of the predictor selected at any step of the selection process, how the reggression coefficients changed with the addition of a new predictor at each step, as well as the steps at which each variable entered the model. As we already know from looking at the list of the regression coefficients self esteem, the dark blue line, had the largest regression coefficient. It was therefore entered into the model first, followed by depression, the black line, at step two. In black ethnicity, the light blue line, at step three and so on. Another important plot is one that shows the change in the mean square error for the change in the penalty parameter alpha at each step in the selection process. This code is similar to the code for the previous plot except this time we're plotting the alpha values through the model selection process for each cross validation fold on the horizontal axis, and the mean square error for each cross validation fold on the vertical axis. This is done in the first plt.plot function. Where m_log_alphascv, is a negative log10 transformation applied to the alphascv_ attribute for each validation fold, and a cv_mse_path_ is the model results attribute containing the mean square error for each cross validation fold. The colon in quotes here tells Python to plot the folds as dotted lines. In the next line of code, I'm asking Python to plot the average mean squared error across all cross-validation folds, and to plot it as a slightly thicker line with equals to black line. I'll use the plt.axvline function to plot a dashed black vertical line at the -np.log10 transformed alpha value for the selected model. Finally, I ask Python to print a legend and add titles for the two axes as well as the plot title and I run the code. We can see that there is variability across the individual cross-validation folds in the training data set, but the change in the mean square error as variables are added to the model follows the same pattern for each fold. Initially it decreases rapidly and then levels off to a point at which adding more predictors doesn't lead to much reduction in the mean square error. This is to be expected as model complexity increases. We can also print the average mean square error in the r square for the proportion of variance in school connectedness. That is explained by the selected model in the training data set and in the test set when the selected model's applied to the test data using the following code. Here we need to import the mean squared error function from the sklearn metrics library to compute the mean square error. We create an object called train_error which is equal to the mean_squared_error calculation function, then in parentheses, the training data set response variable, tar_train, then a comma, and then model.predict(pred_train). This code tells Python to use the results from the model object to predict the response variable for observations in the training data set. We then do the same thing for the test data by using the results from the training set model to calculate the test data mean square error. We use the model.score attribute, which includes the predicted response values for the training, and test data sets for calculating the r square for each set. We then use the print function to print them. As expected, the selected model was less accurate in predicting school connectedness in the test data, but the test mean square error was pretty close to the training mean square error. This suggests that prediction accuracy was pretty stable across the two data sets. The R-square values were 0.33 and 0.31, indicating that the selected model explained 33 and 31% of the variance in school connectedness for the training and test sets, respectively. If we go back to our graph from the bias variance trade-off video that shows what happens to prediction error as a model becomes more complex by adding more predictors. We can see that prediction error decreases as more variables are added to the model, and consequently bias is lower. However, we can see from the results of our example, that the reduction in mean squares error became negligible. If we'd had even more predictors in our example to predict school connectiveness. We would likely see something similar to the graph's test curve, showing an increase in both bias and variance. The model that is selected as the best model, is the one that falls somewhere in here. It is a point where bias and variance and the test prediction error is lowest. If a model with fewer predictors is chosen, then the model is at risk of being under-chosen. If a model of more predictors is chosen, then the model is at risk of being over-fitted.