0:00

This lecture's about preprocessing predictor variables.

And, so as we saw on previous lectures, you need to plot the variables

upfront so you can see if there's any sort of weird behavior of those variables.

And sometimes predictors will look very strange

or the distribution will be very strange, and

you might need to transform them in order

to make them more useful for prediction algorithms.

This is particularly true when you're using model based algorithms.

Like linear discriminate analysis, naive Bayes,

linear regression and things like that.

We'll talk about all those methods later in the

class, but just keep in mind that pre-processing can

be more useful often when you're using model based

approaches, then when you're using more non parametric approaches.

0:37

So, why preprocess?

Here again I'm loading the later, the caret package, and I'm

learning the kernlab package and then I'm attaching the spam data.

Again just like I talked about previously, when you're deciding how to preprocess

data, or how to explore data we only look at the training set.

So, we split data right away into training and testing

data and we set the testing data aside for later.

Now if I look at one of the variables, so again, this is spam data,

so we're trying to predict whether the data is spam or if it's good emails, ham.

And, so the variables are things like how many capitals do we see in a row?

What's the run length for the number of capitals in a row in an email?

If you take, make a histogram of those values, you see, for example,.

That almost all of the run links are very small,

but there are a few that are much, much larger.

This is an example of a variable that is very skewed, and, so it's

very hard to deal with in model

based predictors and so you might want to preProcess.

So, if you take the mean of this variable, it's about 4.7.

But the standard deviation is huge, it's much much larger.

So, it's much more highly variable variable.

And so, what you might want to do is do some sort of preprocessing, so

the machine learning algorithms don't get tricked by

the fact that it's skewed and highly variable.

1:51

So, one way that you can do this is by basically standardizing variables, and the

usual way to standardize variables, is to

take the variable values, and subtract their mean.

Then put, so you take the value, subtract the mean,

and then divide that whole quantity by the standard deviation.

When you do that the mean of the variables that you have will be zero.

And the standard deviation will be one, so that will

reduce a lot of that variability that we saw previously.

And it will, standardize the variable somewhat.

2:21

One thing to keep in mind is, again, when

we apply a prediction algorithm to the test set.

We have to be aware that we can only

use parameters that we estimated in the training set.

In other words, when we apply this same standardization

to the test set, we have to use the

mean from the training set, and the standard deviation

from the training set, to standardize the testing set values.

What does this mean?

It means that when you do the standardization, the

mean will not be exactly zero in the test set.

And the standard deviation will not be exactly one, because

we've standardized by parameters estimated in the training set, but

hopefully they'll be close to those values even though we're

using not the exact values built in the test set.

You can also use the preProcess function to do a lot of standardization for you.

So, the preprocess function is a function that is built into the caret package.

And here I'm passing it all of

the training variables except for one, except for

the 58th in the data set, which is the actual outcome that we care about.

And I'm telling it to center every variable and scale every variable.

That will do that same transformation that we talked about previously to

the data, where you subtract the mean and divide by the standard deviation.

And you can see that by looking at the

mean of the value capitalAve, just like we did before.

And you can see that after using the preProcess function

the mean is zero, and the standard deviation is one.

So, preprocess can be used to perform a lot of the preprocessing

tool, techniques that you, you used to have to do by hand.

The other thing that you can do is you can use the object that's created

using the preprocessing technique to apply that same preprocessing to the test set.

So, here this preObj was the object used on the previous slide.

That was the object that we created by preprocessing the training set.

4:09

So, looking at that value we can see, now,

if we pass the testing set to the predict function.

Then what it'll do is it will take

the values calculated on the preprocessing object and apply

them to the test set object, and so again, in the pre, post process test set data

the mean won't exactly be equal to zero

for any variable and the standard deviation won't exactly

be equal to one, but they'll be close,

because we used the training set values to normalize.

4:36

You can also pass the preprocessed commands directly

to the train function in caret, as an argument.

So, for example here we can send to the preprocessed

argument of the train function, the command, the parameters center

and scale, and that will center and scale all of

the predictors, before using those predictors in the prediction model.

The other thing that you can do is do other kinds of transformation.

So, centering and scaling is one approach, and that will take

care of some the problems that we see in these data.

You can remove, very strongly biased predictors

or predictors that have super high variability.

The other thing that you can do is use other

different kinds of transformations one example is the box cox transforms.

Box cox transforms are a set of transformations

that take continuous data, and try to make them

look like normal data and they do that by

estimating a specific set of parameters using maximum likelihood.

So, if I use the preprocess function and I tell it to

perform box cox transformations on each of the variables, and then I predict.

5:37

Each of the different variables using that preprocess object on the training set.

I can look at the capital average values,

and I can make a histogram of those values.

And now, remember in the original plot on the histogram, they looked

like a huge pile at zero and a few values that were large.

And now you see something that look a little bit more like

a normal distribution, a little bit more like a bell curve here.

You will notice that it doesn't take care of all of the problems.

So, for example there's still a stack set of values here at zero and

in the Q-Q plot, so this is

showing the theoretical quantiles of the normal distribution.

Versus the sample quintiles that we calculated for our

preProcess data, you can see that they don't perfectly

line up and in particular there's this again chunk

down here at the bottom these don't lie on

a perfect 45 degree line, and that's because if

you have a bunch of values that are exactly

equal to zero this is a continuous transform and

it doesn't take care of values that are repeated.

So, it doesn't take care of a lot of the problems that

would happen, would occur with using a variable that's highly skewed though.

6:38

So, the thing that we can do is also impute data for these data sets.

So, it's very common to have missing data.

And when you're using missing data in

the data sets, the prediction algorithms often fail.

Prediction algorithms are built not to handle missing data in most cases.

So, if you have some missing data.

You can impute them using something called k-nearest neighbor's imputation.

So here we set the seed again, because this is

a randomized algorithm, and we want to get reproducible results.

And I take just these capital average values

and, I create a new variable called CapAve.

7:29

So, now I want to know how would I handle those missing values?

I did this so that we could see how we could handle missing values in a dataset.

So, one thing that you going to do is you going to get

news as preProcess function and tell it to do k-nearest neighbors imputation.

K-nearest neighbors computation find the k.

So if k equal to ten, then the 10

nearest, data vectors that look most like data vector

with the missing value, and average the values of

the variable that's missing and compute them at that position.

So, if we do that, then we can predict on our training set, all of

the new values, including the ones that

have been imputed with the k-nearest imputation algorithm.

8:17

One thing you can do is you can look at the comparison between the

actual, in this case, when we set some of the values to be equal

to NA in advance, we can look at the values that were imputed, and

the values that were truly there before we removed them and made them NAs.

And we can see how close those two values are to each other.

And so, you can see, for example, that.

Those values are relatively close to each other.

Most of the differences are very close to zero.

Here you can see the values are mostly very close to zero.

So the imputation work relatively well.

You can also do look at just the values that were imputed.

So again here I'm looking at a capAve

quantile of the same difference between the imputed values.

And the true values, they're only for the ones that were missing.

And here you can see again, most of the values are close to zero, but here we're

only looking at the ones we're missing, so

clearly some of them are more variable than previously.

And then you can look at the ones that

were not the ones that we selected to be NA.

And you can see that they're even closer to each other, and

so the ones that got imputed are a little bit further apart.

But aren't that much further apart.

9:26

So, there is a lot more to learn

about training and testing sets in terms of transformations.

But train, but the things to keep in mind are that

training in test sets must be processed in the same way.

The caret package handles a lot of this under the hood in

the sense that if you train a data set using preProcess functions.

9:44

Built into the train function and caret way

it applies that preprocessed function to the test set.

It will handle all of the preprocessing in the correct way for you.

In general, you need to pay attention to the

fact that anything you do to the training set.

Will create a set of parameters.

You must use only those parameters when you apply it to the test set.

You can't estimate new transformations on the test set alone.

And that means that the test set transformations will likely be imperfect.

Especially if the test and training sets are

diff, collected at different times or in different ways.

Some of the transformations won't necessarily work as well.

10:21

All of the tra, transformations I'm talking about, so

far other than imputation are based on continuous variables.

When dealing with factor variables it's a little

bit more difficult to know what's the right transformation.

Most machine learning algorithms are built to deal with either

binary predictors, in which the binary predictors are not pre-processed.

Or continuous predictors in which case sometimes it's expected

that, the data are preprocessed to look more normal.

You can go to this link that I've linked here to look, into

more information about how to preProcess

data for prediction using the caret package.