Hello. Welcome to Lesson Three.
This lesson is going to build on the previous lesson that introduced
cross validation by showing how you use cross validation to
select the optimal hyper parameters for
a machine learning algorithm while still minimizing the impact or effect of overfitting.
By the end of this lesson, you should be able to
explain what grid search is and how to use it.
You should be able to articulate
how cross validation can be used to select model hyper parameters.
And you should be able to apply
grid search cross validation by using the site scikit-learn library.
Now, this particular lesson has a reading and a notebook.
So for the reading,
I wanted to use this article because it's
an important article that shows how even if you're not overfitting,
you can still have issues if you don't understand your data.
So the premise of this article is that if your training data is biased in some way,
it can impact the predictions your machine learning model is going to make.
So, this article talking about this shows how a machine learning algorithm was trained on
images and these images had
certain characteristics that biased the results that the model learned.
And this is an important thing to keep in mind,
sometimes the data truly are reflective of reality.
But other times, there are biases that might be implicit in the data for whatever reason.
And this is including two that were prominently
being used in machine learning competitions that would
bias the algorithm to think that
someone doing shopping or washing or working in the kitchen would be a woman,
while somebody doing coaching and shooting or things like that would be men.
And the important point here isn't that,
there aren't situations where these sorts of divisions may really occur,
it's that the process of training
a machine learning algorithm doesn't just reflect or mirror these biases,
it tends to amplify them.
So, if there are issues in a data set,
it's not that your model will come out and predict that exact ratio.
It's that it's going to learn there's a difference and
that difference is going to get amplified in
the model construction process such that it will be
a higher relative bias and that can be very problematic.
This article talks about one,
but there's many others out there that have been not just problematic for the companies,
in terms of public relations issues,
they can actually cause problems and cause you to lose business.
And so it's a very important thing to be wary of.
You want to always make sure that you haven't overfit
but you also want to make sure that your data are truly
reflective of reality and that your model is not doing things that could be problematic.
So, that leads us straight into the topic of Model Selection.
We've talked about cross validation and the splitting of your data into a training set,
a validating set, and a test set.
And the idea with model selection is we're going to use that training and validating to
tune the hyper parameters to get
the optimal combinations for the training and validation data.
So, to do this,
we're going to introduce the concept of model selection.
We're going to talk about grid search and in particular,
a multi-dimensional and a parameter grid.
We're also going to introduce the concept of randomized grid search.
And lastly, I'm going to introduce a technique called nested cross-validation.
So, getting started, we, of course,
start with our standard import code.
There's a new line here, we're going to import time.
And the reason we do this is some of these cells will take a while to run.
And so, to give you a feel for that,
I put the timing in here so that when you see the notebook and then,
you go to run it on your own server,
you'll see that it should take a while.
So, don't panic when it doesn't run immediately.
Now, what about model selection?
Well, typically, we're going to want to pick
a hyper parameter or a set of hyper parameters and we're going to vary them.
Now, we're going to start with a single parameter.
We're going to use SVC on the handwritten digit data set.
Now, first, we have to scale it,
SVC likes the data to be properly scaled.
And then what we're going to do is pick a list of C values.
We're going to do a uniform sampling in log space so that we have 0.0001,
0.001, 0.01 etcetera, all the way up to this number here.
We then are going to split our data into test train splits.
And then, we're going to do a StratifiedKFold on our training data.
So, this is stratified meaning we will maintain relative class ratios.
We're going to have a five fold split.
And then, we can apply this to do our cross-validation.
So, every time we iterate,
we're going to get a cross-validation score.
We're going to build up the array of these scores.
We're going to take the mean and the standard deviation
of these scores to each iteration.
And then, we're going to display the compute time and then some statistics at the end.
So, we had eight values for C vals.
Eight different values for that parameter.
We're going to iterate through each of those.
We're going to train our model.
We're going to do our cross-validation and then keep track of these scores.
So there you go. It took five seconds for that one cell to run.
And when we compute the statistics on this,
we can say first what's the highest value,
the maximum value, argmax is a function that says
find this value and return the index into the array for that value.
This allows us to do both the highest value for the mean as well as a standard deviation.
So, it's a very clever way to get when you have arrays that are lined up,
you can find the proper element in one array
and then you know the right element to go to in the other arrays.
So this tells us what our best C parameter was. It's 10.
It also gives us the best score we had and the standard deviation for that score.
Remember that all comes from this cross validation because we effectively
run four different values of that hyper parameter on the training and testing data,
and we do a cross-validation at that particular value.
So, this is why this could be so complex.
We had eight different values for this and we're doing a cross validation,
which in this case, was five folds.
So, we're doing five model iterations on the training data for that particular model.
And there were eight models. So, if you think about it, that's 40 iterations.
That's why the time can start to creep up quickly.
The next thing we can do, we can actually do a classification report on this data.
You see that it does a really good job.
The recall is extremely high.
You can also see there are supports fairly balanced,
that's because, we did stratified cross validation.
We also can make a plot here of the parameters we had.
We built up the arrays of those cross validation scores,
so at every value of C,
we have the mean and standard deviation.
So you can see for very low values of the C parameter,
we had a very low score.
Then it quickly jumps up,
until we get up here and it seems relatively flat,
but this is our best value.
Now one thing to keep in mind,
we did a very coarse grid here, right?
At every effectively factor of 10,
we did another sampling.
So all we really had was here, here,
and here, to figure out that best value.
If we really want to model this,
we probably start out here and go out
throughout here and do it at a much finer resolution,
to get what that value is.
But again remember, every time you do it,
it's five model evaluations for that stratified five way K fold split,
times the number of parameters we're sampling here.
So if we did 100, that would be 500 model evaluations.
You could see why it can become computationally intensive to find that best model.
Now, this is the idea that we're talking about,
we just did grid search, right?
We basically laid out a grid in one dimension.
But in reality, we usually have multiple hyper parameters.
So if we had 10 values of C and were doing SVC algorithm,
we also might be looking at the gamma parameter.
So if we do 10 values for C and 10 values for gamma,
that's 10 times 10 or a hundred different combinations,
model combinations that we have to run our algorithm at.
And when we're doing grid search cross validation, that means we have,
if we're doing five folds validation,
it would be five model evaluations for each parameter grid value.
So that would be 500 model evaluations.
And so you can see, it can take a lot of time to
run these different algorithms and figure out those model parameters.
And that was just two combinations.
If we had more, it would be even more complex.
So first, we start again. We're going to do a grid search.
It's going to automate this process of doing what we just did manually,
by figuring out the best model parameter and the best score.
Grid search, we pass it in a estimator,
in this case it's our support vector classification pipeline.
We pass in a dictionary of parameter grids,
and then we pass in our cross validator.
We fit it to the data.
And when it's done, we get our compute time.
Notice this is eight point two five seconds that we got our best model out.
It was again, Best C value is 10,
and our Best C score, was 0.981,
very similar to what we had before.
We could also though go, what is the score for our test data?
Which is a little different, right? This is good.
Our test data scores very high, which is good to see.
We could also extract other things.
We can say, how long did it take to do that mean fit for each cross validation?
The total fit time for one cross validation.
We could also get our training scores,
test score, et cetera.
There's a lot of information in there.
Again, we can use the argmax to get out what that cross validation means test score was,
and what the best parameter was.
But more interesting than that is, we could take this.
This GSC object return data as a dictionary,
we can turn it straight into a data frame.
And I take the transpose,
because that makes the columns,
rows, and it's easier to see for the different cross validations here.
We had eight different model parameters,
we can see the fit time,
the score time, et cetera.
We can see what was the parameters that we're varying.
In this case, it's just C,
that you can see the values for C. You can also
see the test scores with different rank [inaudible] ,
so what was the top value?
You can see that here, we had 10,
100, a thousand, they were all ranked the same,
because the value was the same for that test score, et cetera.
See all the different parameters that were calculated and returned.
We could also do this in a multidimensional case,
we just merely need to define our grid for the first parameter and the second parameter.
Here, I'm actually narrowing the range down,
so that we have fewer test points.
So rather than eight,
I'm using six for C,
and I'm only using eight for G. So this is actually 48 grid points.
And we're going to be doing the five fold cross validation again,
which means we're going to have five times 48,
which is a lot of model evaluations.
So this will take a while.
Notice that when I run this,
it takes almost 70 seconds.
That's quite a long time.
When we do this, we get the Best C value and the Best gamma value out.
And you can see those, as well as the best cross validation score.
Again, we could turn it into a data frame.
But this time, we can actually compute statistics on this.
So you could say, what's the mean fit time?
What were the different scores?
Et cetera. And that's useful to see,
that sort of statistical analysis of all that data.
We could also make a heat map of the parameters.
So here's gamma, here's C, and you could see,
that this particular part of parameter space is fairly flat.
And so, if we really want to dig into this,
we're probably going to have to do a finer grid in this particular area.
Now, we can do other things.
This is demonstrating how,
depending on the kernel you're using for SVC,
you may have different set of parameters.
So we can actually have a list of dictionaries,
and the grid search will go through these.
So here, we have the parameters for an RBF,
which is both C and gamma.
A linear SVC only has a C value, hyper parameter though.
So we can apply the grid search to this,
and it will now do all of these models,
and all of these models with
the same C value grid and gamma grid that we've defined before.
When we apply it, it takes a little longer than the other one,
pulls out the best model,
that should say model, not gamma.
The Best C value, Best gamma,
Best cross validation score.
And then, we can actually imply a parameter grid,
and this is effectively taking what we've done before but turning it into an actual grid,
and it allows us to sample what those parameters are.
Now, if you have a lot of hyper parameters and you're sampling them at a fine resolution,
the number of model evaluations blows up very quickly.
So, typical grid search where you evaluate
every model uniformly on all the points can take forever.
So, one way to sort of try to speed up the process,
is this randomized search cross validation.
The idea is that you don't test every point.
You start sampling randomly and you start to try to figure out what
does the shape of this hyperparameter space look like.
If you remember from the gamma C space,
we had a large part that was fairly flat,
and if we can find that quickly,
then we don't have to explore all the other parts of parameter space.
So we can do this, by basically just changing out
the pipeline that we were using before from grid search to randomized search.
We still use the same support vector classification pipeline.
We're going to use this new parameter grid,
we're doing explicit dictionary.
We're saying that, the C parameters should be uniformly sampled,
but the gamma should be from these G values.
And then we're going to iterate 20 times.
When this goes through, you can see it did
20 parameter combinations, that's what we said to do.
It took 12 seconds,
and we can see that our values come out.
Pretty good cross validation score.
There's our C value, there's our gamma.
So that's a way to speed up that process.
You don't effectively do all the evaluations,
you do a lot of them and you sort of get
a reasonable feel for what that optimal value might be.
Now, one thing you might want to think about is,
when you don't have a lot of training data,
how do you use cross validation effectively and still maintain some sort of testing?
One way to do that is to actually do nested cross validation,
where you do a cross validation,
and then you do a another cross validation inside that.
So you effectively do a trained test split.
And then on the train, you do another cross validation.
And this effectively gets you folds of folds, if you will.
So, that's what we're going to do here.
We're first going to iterate through a cross validator.
We're going to split our data into a trading and a testing.
We're then going to apply a grid search to this new data,
where you see it right here.
And then, we're going to pull out what is our best estimator and we're going
to accumulate or print out the results.
So this we're going to have the outer loop,
it's a five fold cross validation.
We then come in and do a grid search on that particular fold,
and display the results.
When we do this, we can see that there is a Best CV score,
pretty high, what's the kernel?
What's the C value? How long that take?
We then do the next one, et cetera, et cetera.
And you notice that each of these is taking you on the order of 80 seconds.
So that's five of them.
That's a lot of time to run that particular cell.
The benefits of this is that,
we are maximizing the use of the data to predict the best model hyperparameters.
So we've gone through a lot in this notebook and in this lesson.
We've introduced the idea of grids and
using them to search for the optimal hyperparameter combinations.
We've talked about how this can be used in pipelines,
different algorithms, and classification.
If you have any questions about this,
please let us know. And of course, good luck.