In this video, you'll learn about some potential problems with regression that you should look out for when analyzing data. If these problems occur, correlation and regression results might be heavily distorted. We'll discuss nonlinearity, problematic outliers, erroneously inferring causation from correlation, inappropriate extrapolation, the ecological fallacy, and restriction of range. The first thing to always keep in mind is that correlation and simple linear regression capture linear association. To show what can go wrong if you apply a linear model to nonlinear data, Anscombe created different data sets that show linear and nonlinear patterns but that all result in the same means, standard deviations and correlation for x and y. In all cases, Pearson's r is 0.8 and r squared is 0.64. The first scatter plot shows the linear pattern you would hope to find. In the second plot there's an obvious nonlinear curve pattern. Remember, these data produced the same Pearson's r and r squared as the first data set. Without looking at the plot, we might conclude the relation is strong and that the linear model fits well. But obviously a curve would describe these data much better. There are ways to model this type of pattern, even using multiple linear regression, but we won't go into these methods here. For now you should realize that fitting a simple linear model to these data is not the optimal choice. Two other data sets show how outliers can have an unwanted influence. In the third data set you can see that the outlier, the deviant data point, messes up a perfectly linear relation. It changes the lines location and slope and lowers our potential r squared. In the fourth data set, you can see the exact opposite. If we discount the outlier then there's no variation in x. So the correlation is 0 but by adding the outlier Pearson's r and r squared suddenly become quite high. Obviously, this regression line strongly miss represent the actual relation between x and y. Outliers can have an unwanted influence if they're extreme on one or both variables and deviate strongly from the regression line, if they are regression outliers. Outliers are more problematic in small samples, where the influence of each case is relatively strong. Another problem is erroneously inferring causation from correlation. For example, just because regular intake of vitamin supplements is related to greater health, this doesn't mean we can infer use of supplements improves health. Perhaps, healthier people proactively pursue a healthy lifestyle, and do supplements more often because they think it's healthy. Just because we can use regression and use vitamin intake to predict health doesn't make the relation causal. We could also use health to predict vitamin intake. The only way to tell if the relation is causal is to perform a truly randomized experiment. Inappropriate extrapolation is another problem in regression. Take the example where we predicted the popularity of cat videos using the cat's age. As the age of the cat increases, video popularity goes down. But our sample only covers cat ages from 3 months, 0.25 years to 2.5 years. We can't extrapolate, extend the regression line endlessly beyond this range. For example, it wouldn't make sense to predict video popularity for 100-year old cat. Simply because cats don't reach that age. It's also possible that between the ages of 5 and 25, for example, the drop in popularity actually decreases nonlinearly. So we should be careful when extrapolating beyond the sampled range. The ecological fallacy refers to drawing inappropriate conclusions about individual cases when correlation or regression is based on aggregates of these cases. For example, if we have a lot of data on the relation between vitamin intake and health from different countries, we can aggregate over countries. Aggregation eliminates individual variability, and tightens the data points around the best fitting line. As you can see, the ecological fallacy can heavily affect the correlation, and the same goes for regression. As long as the relation is the same, the individual and country level, location of the regression line will not change much. R square will be higher, however, resulting in an overestimation of the fit of the model. Of course, aggregation is okay as long as we don't use results based on the aggregate data to draw conclusions at the level of individuals. The final potential problem I want to mention is restriction of range. Restriction of range means that our sample contains a limited range of predictor values. And our sample cat age vary between three months and two and a half years. The average life span of an indoor cat is about 15 years. So our range is restricted. We're missing values between 2.5 to 15 years. The correlation in our sample is 0.7. If we had collected data in the missing range, assuming the linear relation continues to hold, the scatter plot would look like a cloud of data points that is more ellipse shaped than unlimited sample. Restriction of range can seriously lower Pearson's r and r squared. The location of the regression line is less effective however. As you can see from these examples, it's important to always look at the scatter plot of the data to check for nonlinearity and outliers. To consider how representative the range of data is and to consider what kind of inferences you're making.