Is major depression associated with smoking quantity among current young adult smokers? Or in hypothesis testing terms, are the mean number of cigarettes smoked per month equal or not equal for those individuals with and without major depression? The explanatory variable here is categorical with two levels. That is the presence or absence of major depression. The response variable, smoking quantity, measured by the number of cigarettes smoked per month ranges from 1 to 2940. Returning to the Python program we began to built with our data management and visualization course. Let's just quickly review the pieces of the program for the Nisark data. First, we include statements to import the needed Python libraries. We will again import pandas and numpy. But, in order to calculate the ANOVA F-statistic and its corresponding p-value, we also need to import the statsmodels.formula.api, which will allow us to fit statistical models. Because the name of this library is particularly long, I'm going to import it as smf, so that I can use this shorter name when referring to the library later on in my program. Next, I load my dataset and convert my variables of interest to numeric. I also create a new data frame representing the subset of observations I would like to include in my analysis. Young adult smokers who have smoked in the past 12 months. After setting appropriate values to missing, that is NAN, I conduct additional data management needed for a secondary variable by multiplying smoking frequency and smoking quantity. This creates a new quantitative variable that measures number of cigarettes smoked in the past month. To calculate the analysis of variance F-statistic and associated p-value, we're going to use the Ordinary Lease Squares, or OLS function. This function is part of the stats models formula API package. OLS is a powerful modeling approach that we will learn to use flexibly, and in a multivariate context in our follow up course on regression modeling in practice. For now though, we will use the modeling approach in the context of ANOVA. Specifically, here is the code that we would use to test differences in the mean number of cigarettes smoked in the last month, NUMCIGMO_EST, as our response variable. For young adult smokers with and without major depression. Major depression being our explanatory variable. First, I name my model. Here I am calling it model1. Next I include the equal sign and the OLS function from the Stats Models Formula API Library, which I have imported as smf. Within parenthesis, I then write my formula including the name of my quantitative response variable, NUMCIGMO_EST, followed by a tilde and then the name of my categorical explanatory variable, MAJORDEPLIFE. I will also need to indicate to Python that this is a categorical variable by adding a capital C and putting the variable name within parenthesis. Notice that the word formula is followed by an equals sign and that both variables in my model separated by a tilde are included within quotation marks. Also within the parenthesis, I include a comma and then the name of the data frame where the variables are located. Here I use the sub one data frame which includes 18-25 year olds who have smoked in the past 12 months from the Nisark sample. This statement is followed by a request for fit statistics for the model I have just defined, that is model1. And here I am calling the object which holds the calculations for these model statistics results1. And as always I need to explicitly ask Phyton to print these results. Here, I'm requesting results1 with the summary function. So now we're ready to run the program and take a look at the output. As you can see, our OLS output has generated a number of model estimates, including the F-statistic and associated p-value. Our calculated F-statistic is 3.55, and the p or probability value for this F-statistic is .0597, just over our p.05 coupling. To interpret this finding fully, we need to examine the actual means for the number of cigarettes smoked in the past month for those young adult smokers with and without depression. To do this, I will add syntax that creates a data frame with only the variables included in my model. MAJORDEPLIFE and NUMCIGMO_EST. I'll call this dataframe sub2. I also include the dropna function, so that the dataframe only includes observations with valid data for both my explanatory and response variable. Because it was these observations that were included in the OLS analysis. Then I include the group by function to request means and standard deviations for any variables in my new data frame, grouped by MAJORDEPLIFE. That is, those values indicating individuals with or without major depression. So, here I see that young adult smokers without major depression, as indicated by a value of zero, smoke an average of 312.8 cigarettes per month. And that those with major depression, indicated by a value of one, smoke on average, 341.4 cigarettes per month. Because the p-value is greater than 0.05, actually near 0.06, we can choose to accept the null hypothesis and say that these means are statistically equal. And that there is no association between the presence or absence of major depression and the number of cigarettes smoked per month, among young adult smokers. If I had chosen to reject the null hypothesis, I would be wrong nearly six out of a hundred times. And again, by normal scientific standards, this is not adequate certainty to reject the null hypothesis, and say that there is an association. Instead, we're going to accept the null hypothesis and say that there is no association. Had the p-value been less than .05? I would be more confident that there was a significant association. To interpret a significant association, I would look at the means table. If p would have been less than .05, I can see that the individuals with major depression smoke more than individuals without. And again, with a significant p-value I could have said that young adults smokers with major depression smoke signficantly more cigarettes per month than young adult smokers without major depression. Note that most often, you will also see a warning message regarding standard errors below the OLS table. Indicating that you can assume that the covariance matrix of the errors is correctly specified. This tells us that standard error estimates are valid. As long as the underlying assumptions about the errors in the OLS regression are met. We'll cover the topic of assumptions associated with OLS models in the regression modeling and practice course. But for now you can be confident that the F-statistic is a robust test provides valid inferences under a wide range of conditions. So we've shown you the ropes in terms of a categorical variable that has two levels as it did here with major depression. For this interpretation, all we need to know is the p-value and the means for each of the two groups.