0:00

[MUSIC]

Â Hi, in this module, I'm going to introduce you to correlation and regression.

Â These are basic statistical techniques.

Â You can actually take semester long or

Â even year long courses introducing you to the basics as well as various extensions.

Â All I can do here is give you a taste of what they are and

Â some sense of what the interpretation of the results is.

Â 0:33

So correlation and

Â regression are typically used to study relationships along continuous variables.

Â When we just looked at tabulation we use that to look at relationships among what

Â we referred to as categorical variables.

Â For example, we looked at different categories of year and

Â different categories of educational attainment.

Â And then looked at how the population was sorted into the categories

Â of education according to where they were by year.

Â So tabulation works well for

Â a wide range of variables that are inherently categorical.

Â For example, race, ethnicity.

Â These are things where there is no quantitative measure,

Â you have to divide people into groups and then tabulate.

Â There are advanced methods for regression related techniques but

Â we don't have time to get into them now.

Â So continuous variables refer to variables that we can measure as a quantity.

Â Things like income, height, weight, things that we think of as continuous.

Â So correlation and

Â regression are used to study relationships among these sorts of variables.

Â Things that we want to preserve in their original form as continuous variables.

Â Even though, as we saw in the previous example,

Â sometimes we can actually categorize them, or turn them into categories.

Â 2:08

Basically, correlation and regression measure the strength

Â of the linear relationship between two variables.

Â Linear refers to the idea, is that single unit changes in one

Â variable are associated with fixed changes in the other variable.

Â So for example, if we're thinking about regressing weight on height,

Â then a linear regression will tell us, for example, a one inch increase in height,

Â what's the average increase in weight in a population?

Â 2:48

So by themselves, and this is extremely important,

Â correlation and regression have nothing to say about cause and effect.

Â Saying anything about cause and effect, as we learned in previous lectures,

Â really requires a proper experimental design.

Â That includes exogenous variation in the x variable that we are sure is not

Â related to other variables that are also influencing our outcome or y variable.

Â 3:20

We often want to assess whether a unit change in one variable,

Â it could be height, it could be education measured in years.

Â Five years, six years, seven years, eight years of education.

Â Whether that's systematically associated with changes in some outcome variable or

Â y variable.

Â So here we have a hypothetical pair of variables,

Â a y variable that we're trying to explain.

Â An x variable that we think might be driving or influencing y.

Â So what a regression does is it measures, on average,

Â if you increase X by one, as shown on the blue figure here,

Â what's the average change in Y?

Â That's all that regression is, it's really an attempt to come up with a estimation

Â of the linear association that is associated with changes in x,

Â leading to changes in y.

Â 4:21

So this is the same figure, but blown up to make it a little clearer so

Â you can see that again we're trying to estimate how a one unit change in x,

Â what sort of impact or association does it have with changes in y?

Â That's what we're trying to measure.

Â 4:49

The average change, in this case 1.3,

Â every unit increase in x leads to a 1.3 unit in y.

Â That average change is what we refer to as the regression coefficient.

Â That comes out of our estimation of a regression model.

Â 5:09

I want to talk about correlation,

Â which is actually just a special case of linear regression.

Â Correlation measures the number of standard deviations of change in some

Â outcome variable y associated with a one standard deviation change in x.

Â Actually, in the case of correlation, what we choose as y or

Â what we choose as x is actually arbitrary.

Â We can reverse them, and we still get the same correlations.

Â So, let's think about two variables, we'll just call them x and y, but

Â we could flip them and we'd end up with the same answers.

Â But we have these x variable, which has a mean of 4.4,

Â and a y with a mean of 10.77.

Â The standard deviation of x is 3.16 and

Â the standard deviation of y is 4.75.

Â As you may remember, standard deviation is a measure of the amount of

Â spread or variation within a distribution.

Â I can't get into too much detail right now, it's basic statistics,

Â you may want to go back and check your textbooks if you need a refresher.

Â Now, if we think about the change in y associated with a change in x,

Â correlation is the change in y that occurs when

Â we move x by an entire standard deviation.

Â So here, if we move x by 3.16, you see the bar that just turned solid,

Â that goes between the horizontal solid blue bar and

Â the red regression line, that's the amount of change in y

Â associated with a one standard deviation change in x.

Â 7:09

So what we really want for the correlation coefficient, it's that change

Â in y that occurs as the result of a one standard deviation change in x,

Â as a fraction of the standard deviation of y.

Â So whatever that number is, it's about four.

Â We move from about 10 to about 14, when we moved x up by 3.16.

Â The correlation coefficient is actually just that shift,

Â that four, over 4.75, the standard deviation of y.

Â So, we can think about the correlation of x and

Â y as the height of that solid bar that we've circled here,

Â divided by the height of that dash bar, again, that we've circled.

Â So the change in y associated with a one standard deviation change in x,

Â divided by the standard deviation of y.

Â So the change in y as a fraction of the standard deviation in y.

Â 8:26

When we talk about standardization that refers to

Â taking a set of numbers perhaps our x variable.

Â And for every number in x, every observation in x,

Â we subtract the mean of x and then divide by the standard deviation of x

Â to produce a standardized score that reflects the number of standard deviations

Â that a particular observation is away from the mean of that variable.

Â So if we have a standardized score of two, that would mean

Â 9:02

that the standardized score was, well, two.

Â And that the value of x was two standard deviations above its mean.

Â An easy example, in terms of thinking about it, is intelligence tests.

Â So intelligence tests are actually constructed by design to have a mean

Â of 100 and a standard deviation of ten.

Â So somebody with an IQ of 130,

Â if we subtract 100 from 130 that gives us 30.

Â Then we divide by ten, the standard deviation.

Â We get three so a z score of three implies being three

Â standard deviations above the mean, or in this case say, IQ of 130.

Â Now standardized values are also called z scores.

Â You may have heard of them in that context in some other place.

Â So the correlation coefficient, usually denoted with a small r in italics,

Â is the average change in the z score of y, the standardized score of y,

Â associated with a one unit change in the standardized score of x.

Â That is a one standard deviation change in x.

Â 10:23

Now, it's important to keep in mind that one of the special properties of

Â correlation coefficients is that because they reflect associations among

Â standardized scores, they are unit-less.

Â So the standardized scores actually divide out the original units in which we

Â measured, whatever it is we're looking at.

Â So if we're looking at the correlation between, perhaps, weight and height.

Â Now, if we ran a regression,

Â we might get a coefficient that would reflect the average number of pounds

Â in increase associated with a one inch change in height.

Â And if we recalculated the data to use centimeters and kilograms, and the,

Â looked at the number of kilograms of change associated with one centimeter

Â change in height, we'd get different regression coefficients.

Â A correlation coefficient for both sets of data would be actually the same.

Â Because in both cases if we take our inches and

Â our pounds data, convert to standardized scores, subtracting the means,

Â dividing by the standard deviations.

Â We sweep away the original units and

Â we're just left with unit-less standardized scores.

Â So, correlation coefficients are irrelevant or

Â unrelated to the underlying units in which values are measured.

Â Correlation coefficients also are always between minus one and one.

Â A one standard deviation change in one variable can never produce

Â more than a one standard deviation change in the other variable.

Â That's a mathematical proof that we don't have time for here.

Â