Welcome to Fundamentals of Quantitative Modeling, Regression Models. In this module we're going to talk about a regression model. We'll define what that is. We'll discuss questions that a regression model is able to answer for us. We're going to talk about correlation and linear association because these regression models are examples of linear models. We're going to discuss the mechanics of fitting a line to data. We are going to spend some time interpreting the output from these regression models and we're going to talk about prediction and in particular prediction intervals from a regression model. We will briefly see the topic known as multiple regression. It allows us to create potentially more complicated and realistic models of a business process. We'll end up by talking about logistic progression, which is an appropriate form of regression to use when the outcome variable is a categorical variable. Typical a 0, 1 outcome, a Bernoulli random variable. So what is a regression model? Well, a simple regression model uses a single predictor variable, which often give the letter X to, to estimate the mean or the average of an outcome variable Y as some function of X. And so, the example that I have here plots the price against the weight of a set of diamonds. Each point in the plot represents a diamond. The X coordinate is the weight of the diamond, and the Y coordinate is the price of the diamond. So there's a single predictor. So that's why we say a simple regression and what a regression model will do is say at any given value of X, what do you expect Y, the price, to be? So that's the idea of a regression model. It's a model for the mean of Y as a function of X and very frequently we will use a linear model to capture the relationship. So continuing with the diamonds example, the predictor variable is the diamond's weight and the outcome variable is the price of the diamond. Now we can see just by looking at the plot, that heavier stones, bigger diamonds, tend to cost more money. We would often term that positive association. But we can go behind that simple statement by using a regression model that will formalize the idea of the association and more precisely define how we expect, or what value we expect the price to be at for any given weight of a diamond. So we're going to formalize how the expected price varies with weight. And as I just said, one of our most frequently used ways of capturing that relationship is with a straight line and we'll then call it a linear regression. And the formula that you can see at the bottom of this slide is how I would write a regression model. And it says that the expected value, that's the average of Y, and then the straight line there means given. We articulate that as given. The expected value of Y given X. The expected price of a diamond given its weight is then equal to some function of X. And the most straightforward function that we might choose to use is a linear function, and we write the linear function in this instance as b0 + b1 times X. Sometimes you will have seen the equation of a straight line written as Y = mx + b. This is still a straight line, but we have a slightly different notation typically in the regression models and there's a reason for that. And the reasons is that there's a form of regression called multiple regression, which has many Xs in and then we can use a notation that incorporates b naught, b1, b2, b3, etc. So we subscript the coefficients. B naught is still the intercept and b1 is still the slope. So regression model is relating the average of Y to a particular value of X and its not at all uncommon to assert that that association is at least approximately linear, and in that case, we're doing a linear regression. On this slide I have overlaid the straight line model that is calculated from the underlying data. I haven't told you how this line is calculated yet. I will in a few minutes. But there's the regression line. And the slope and intercept in this particular instance are presented in the formula below. The expected value of the price of a diamond given its weight is equal to -260, that's the intercept, + 3721. 3721 times the weight, whether weight is measured in carats. So that's what a linear regression is going to do for you. It's going to put a line through the data basically and once you've got a line going through the data, there are a number of useful things that you're going to be able to do with that. So there's a quantitative model that has been derived from underlying data. So, we let the data talk to us in the sense that the data chose the best fitting line. Now there's a very commonly used number to describe the strength of what we term linear association. So essentially, how close are the points to a line? And the way that we capture that is through a concept called correlation. So correlation is a measure of the strength of a linear association. And correlation is typically given a letter. We call r the sample correlation. And it's a fact that the correlation will always lie between minus one and plus one. If you have a negative value to the correlation, then you have negative association. That would be a line from the top left to the bottom right. If you had positive correlation, you got positive association that would be a line from the bottom left to the top right. But if you have zero correlation what that means is that there is no linear association. It doesn't actually mean there is no association between the two variables. Just says there's no linear association between the two variables. Now how would you calculate the correlation in practice? The answer is with a computer program or a spreadsheet. So we won't worry about the details of the actual calculation, it will happen, and I have calculated the correlation for the diamonds dataset, and it turns out to be 0.989 so there's a correlation which is incredibly strong correlation as far as correlations go, and it's just asserting the fact that the points really do lie very close to a straight line. So a linear model is quite reasonable in this particular instance.