On the other hand, if you can give a person one kind of aspirin and

then later on give them a different kind of aspirin when they have another

headache, that would compare each person to themselves, right?

Control block on the person so to speak.

So that's a design strategy.

Now there's some nuance with this design strategy as well.

What happens if there's some residual effect of the first aspirin when you

give the second one?

So maybe you could handle that with some sort of wash out period,

long wash out period or something like that.

But anyway, the point of that design is to make it, so that you're comparing people

with themselves to control and everything that's intrinsic to the person.

These across time periods control for that by giving both aspects to each person.

Maybe you would randomize the order in which they received them that's

called a crossword design.

At any rate, the broader point that I'm trying to make is, it's often the case

that good thoughtful experimental design can really eliminate the need for

some of the main considerations that you would have to go through in model

building if you were to just collect data in an observational fashion.

[SOUND] The last thing I would say is there's one automated search model

technique that I like quite a bit and I find it very useful and

it's the idea of looking at nested models.

So, I'm often interested in a particular variable and I'm very

interested in how the other variables that I've collected will impact it.

So, I'm interested in a treatment or something like that.

Some important variable, but I'm worried that my treatment groups and

imbalanced with respect to potentially some of these other variables.

So what I'd like to look at is the model that just includes the treatment by itself

and the model that includes the treatment and let's say, age.

If the ages weren't really balanced between the two treatment groups and

then one that looks at age and gender, if maybe the genders between the two

groups weren't really balanced and then so on.

And this idea of creating models that are nested,

every successive model contains all the terms of the previous model

leads to a very easy way of testing each successive model.

And these nested model examples are very easy to do, so I'm just

going to show you some code right here on how you do nest and model testing in R.

So I fit three linear models to our SWF dataset,

the first one just includes agriculture.

Let's pretend that, that's the variable that we're interested in and

then the next one includes agriculture and examination in education.

I put both of those in,

because I'm thinking they're kind of measuring the same thing.

But now after this lecture, I'm concerned over the possibility that they're too

much of measuring the same thing, but let's put that aside for this time being.

And then the third model includes Examination + Education + Catholic +

Infant.Mortality.

So, all the terms.

So now, I have three nested models and I'm interested in seeing what happens to

my effect as I go through those three models.

The point being, in this case, you can test whether or not the inclusion of

the additional set of extra terms is necessary with the ANOVA function.

So I do anova fit1, fit1 and fit5.

That's what I named them, one, three, five.

And then you see down here, what you get is a listing of the models.

Model 1, model 2, model 3 and then it gives you the degrees of freedom.

That's the number of data points minus the number of parameters that it had to fit.

The residual sums of squares and

the excess degrees of freedom of going Df is the excess degrees of freedom of

going from model 1 to model 2 and then model 2 to model 3.

So we added two parameters going from model 1 to model 2, that's why that Df is

2 and then we added two additional parameters going from model 2 to model 3.

So the two parameters we added from going from model 1 to model 2 is we added

examination and education, they're two regression coefficients.

Going from model 2 to model 3, we added Catholic and

Infant.Mortality there to regression coefficients.

With this residual sum to squares and the degrees of freedom,

you can calculate so-called S statistic.

And thus, get a P value.

This gives you the S statistic and the P value associated with each of them,

then here it shows that yes, the inclusion of education examination Information

appears to be necessary when we're just looking at agriculture by itself.

Then I look at the next one it say, yes, the inclusion of Catholic and

Infant.Mortality appears to be necessary beyond just

including examination, education and agriculture.

So if the way in which you're interested in looking at your data naturally

falls into a nest model search as it often does,

I think when you're interested in one variable in specific.

As in this case,

I think this would be a pretty natural way of thinking about the series of analyses,

then some kind of nested model searches a reasonable thing to do.

It doesn't work if the models that you're looking at aren't nested.

For example, if I had the first model or model 2 had an examination, but

not education and the third model had education, but not examination.

This wouldn't apply, you'd have to do something else.

And there, I think you get into the harder world of automated model selection with

things like information criteria.

So I would put all that stuff off to our prediction class and

just leave you this one technique that's useful in the one specific instance

where you've decided to kind of look along a series of models,

each get increasingly more uncomplicated, but including the previous one.

So, I hope in this lecture that you've gotten a couple of model selection

techniques that you can use.

I hope you've also learned that there are some basic consequences that occur, if you

include variables that you shouldn't have or exclude variables that you should have.

These has consequences to your coefficients that you're interested in,

they have consequences to your residual variance estimate.

We didn't even touch on some other aspects of [INAUDIBLE] model that could occur,

such as absence of linearity and other things like that, non-normality and so on.

So again, it's generally necessary to take your model that's the a grain of salt,

because more than likely one aspect of your model is wrong.

And I'll leave you then with this famous quote by George Box who vary

famously said, all models are wrong, some models are useful.

And I think that's a very credo to go along with that yes, for

sure your model is wrong, but it might be useful in the sense of being

a lens to teach you something useful and true about your data set.