0:00

Hi, and welcome to the class on multivariable regression,

as part of our course here at Data Science Specialization.

In this lecture,

we're gonna talk about instances where we have lots of potential predictors for

an outcome, rather than just a single predictor in linear regression.

In any instance, where you're using a predictor X to predict a response Y,

and you find a nice relationship, if that predictor

hasn't been randomized to the subjects or units, whatever you're looking at,

0:40

Imagine if you had a friend that downloaded some data, where they had all

sorts of health information from people and also their dietary information.

And this person said, hey, I found something interesting.

Breath mint usage has a significant regression relationship with forced

expiratory volume, a measure of lung function, pulmonary function.

1:03

If you've ever gotten an asthma test, they'll measure your FEV.

Well, you would be skeptical.

I mean, there's very little basis for a biological relationship there.

Breath mints are just sugar.

It doesn't seem reasonable that they would impact lung function.

But maybe, but what you've really be thinking is,

well, what other variables might be explaining this relationship?

And you might come up with two hypotheses.

One is, this person dug through lots and lots and lots of variables and

just found the one that was significant, and it's just a chance of association.

And that's the problem of multiplicity.

Okay, so we'll talk about that in other aspects of this course in the inference

course, but let's just assume that this person didn't do that, they looked only

at a couple of variables, and the multiplicity concerns weren't so bad.

Then what would you think?

Well, likely, you would think,

well, probably the real problem is smokers tend to use more breath mints, and

smoking has this long relationship with lung function, so

it's well-established that chronic exposure to a smoker,

even second-hand smoke has negative impacts on lung function.

So it's probably that, it probably has nothing to do with the breath mints,

it's a indirect effect of breath mints through smoking,

not a direct effect of breath mints on lung function.

That would be your likely hypothesis.

2:22

Well, how would you establish that there's a breath mint effect beyond smoking?

Well, likely, what you would ask your friend to do is,

well ,consider smokers by themselves, and see whether their lung function

differs by their breath mint usage, and consider non-smokers by themselves, and

see whether their lung function differs by breath mint usage,

where you've conditioned on smoking status so that you're comparing like with like.

2:50

Well, multi-rate regression is sort of automated way to

do that in a linear fashion.

It make assumptions, which is fair enough, but it does that sort in automated way,

and we'll explain in this lecture in what way is it trying to sort of

hold smoking status constant while looking at breath mint usage,

and how it adjusts, and we'll also talk a little bit about its limitations.

But that's the fundamental idea of what multivariable regression is trying to do.

It's trying to look at the relationship of a predictor and

a response, while having, at some level, accounted for other variables.

3:29

I wanted to also talk about another use of multivariable regression for prediction.

It's actually a very good prediction model.

So, as an example, several of us engaged around here engaged in this so

called CAGO competition, to predict the number of days that a person

would be in the hospital in subsequent years given their claims history and

number of days they were in the hospital in previous years.

And this is of major importance to hospitals and

insurance companies and healthcare providers for a variety of reasons.

It's one of the main questions in that field.

So, in this competition, they gave you historic claims data,

a lot of it, actually, which had lots and lots of predictors, and

the number of days that several people of the insured people from

this company were in the hospital for [NOISE] three consecutive years or so.

4:57

And the other is to avoid overfittings.

We'll learn that, as you put enough variables into a multivariable

progression model, you'll get zero residuals just by virtue of having

included even random vectors into your regression model.

So certainly there's consequences to throwing lots of garbagey

predictors into a model.

And certainly there must be consequences to omitting

important predictors in a model.

5:23

And in the practical machine learning class,

which is also part of the specialization, you will learn a lot about model

selection strategies as they relate to the idea of prediction.

In this class, we're gonna focus more on the problems from the previous slide,

where we want to generate parsimonious models, where we're

deeply interested in interpreting the coefficients from the linear model.

And so the prediction problems, like this one,

are a little bit more geared toward our practical machine learning class.

But I just wanted to mention that multivariable regression is

a pretty good starting point in any prediction,

anytime where you're developing a prediction algorithm.

What we found in that competition is that multivariable regression got you very

close to the winning entry, and

lots of machine learning, and random forest, and boosting, and

all these other things, those only got you minor bumps on top of that.

And so, to get huge drops in prediction error,

well-thought out linear models sufficed, and

then to get really minor increments beyond that, you had to throw a lot of computing.

And to be fair, those did improve your chances, and we moved up

a little bit in the leaderboard by adding some of these more complicated things.

But it was remarkable how far you could get with just well done linear models.

6:47

So, our linear model looks an awful lot like our simple linear model.

It's just that we have more variables.

So our outcome, Y, it might be insurance claims or

forced expiratory volume, is equal to a bunch of coefficients.

These are the beta terms, they're like the slope terms in simple linear regression.

Just, now there's more predictor terms, more X values.

So, for example, one of these might, X1 might be breath mint usage of binary or

variable, and X2 might be number of pack years or how much a person smoked.

And in the insurance case, X1 might be the number of

insurance claims in the previous year, and X2 might be whether or

not this person had a particular cardiac problem, something like that,

that might lead toward information about hospitalization in the successive year.

So here, we just write out the linear model.

And it just looks like the outcome is equal to a bunch of coefficients

times predictors.

7:52

Now, it's linear because it's linear in the coefficient.

So I'll reiterate that point in a minute.

I would also add that the first variable is typically just a constant one,

so there's an intercept that's included, a term that's just beta by itself.

Beta zero usually or Beta one.

8:12

So our least squares,

I think everyone could probably guess what the least squares is going to do.

It's just gonna look at the differences between the outcome and the prediction

from the linear models, summation of the predictors times their coefficients.

Because that's not necessarily positive,

we're gonna square it and we're going to add it up so this difference for

every observation equally weights into this loss function that we created.

So least squares simply wants to minimize this equation.

And it's a direct extension of the equation we wanted to minimize

when we had simple linear regression.

8:47

And I would notice that the important linearity is linearity in the coefficient.

So for example, if I take one of my X's and just square it, meaning

if I have a vector in R that I've just squared every element of that vector and

put that in as part of my model, then that will still be linear in the coefficient.

The coefficient won't be squared, it will just be the X term.

And so, the important version of linearity is linearity in the coefficient.

That's what defines a linear model.