An alternative rule would not train quite so tightly to

the training set, but would still use the basic principle of

if you have a high number of capital letters then your

spam message, and so this rule might look something like this.

If you're above 2.40, your cap, your spam message, if

you're less than or equal to 2.40, then you're nonspam message.

So this rule on the training set would then miss that one value.

In other words, you could have a prediction of nonspam for that one

spam message that was just a little bit lower in our training set.

So overall, this looks like in that training set that the accuracy is

a little bit lower for this rule, and it's a little bit more simplistic.

So then we can apply it to all the spam data.

In other words apply it to all the values, not just the values that we

had in the small training set, and these are the results that you would get.

So this is a table of our predictions on the

the rows here, and in the columns that's the actual values.

And so you can see the number of errors that we make are the

errors here that are on the off-diagonal

elements of this little matrix that we created.

So those are the number of errors that we made, made.

And so what we can look at is that we can actually look

at the average number of times that were right using our more complicated rules.

So this is just the sum of the times that our

prediction is equal to the actual value in the spam data set.

And so that happens 3,366 times in this data set.

And then we could also look at the

more simplified rule, the rule where we just used

a threshold, and also look at the number of

times that that's equal to the real spam type.

And you can see about 30 more times, we actually

get the right answer when we use this more simplified rule.

So, what's the reason that the simplified rule

actually does better than the more complicated rule?

And the reason why is over fitting.

So, in every data set we have two parts, we have

the signal component, that's the part we're trying to use to predict.

And then we have noise, so that's just random variation in

the dataset that we get, because the data are measured noisily.

And so the goal of a predictor is to find a signal and ignore the noise.

And in any small dataset, you can always build a

perfect in-sample predictor just like we did with that spam dataset.

You can always carve up the prediction space in this, in this

small data set, to capture every single quirk of that data set.

But when you do that, you capture both the signal and the noise.

So for example, in that training set there was one stem value

that has slightly lower capital average than some of the non-span values.

But that was just because we randomly picked a data

set where that was true, where that value was low.

So that predictor won't necessarily perform as well on new samples,

because we've tuned it too tightly to the observed training set.

So, this lecture has two purposed.

One is to introduce you to the idea of in sample and out of sample errors.

In-sample errors are errors on the trainings that

we actually built with, and out of sample

errors are the errors on the data set

that wasn't used to build the training predictor.

And also we introduced to you this idea of over fitting.

In that we want to build models that are simple and robust enough that

they don't actually capture the noise, while they do capture all of the signal.