Hello, and welcome to this lesson which will introduce Support Vector Machine algorithm.
SVM is, historically,
a very popular machine learning algorithm because it's
both fairly simple to understand and very powerful.
It can be applied to both classification problems,
where it can sometimes be called SVC,
as well as to regression problems.
This particular lesson has two components that you need to do.
One is, you're going to look at
a Support Vector Machine algorithm notebook that's online.
It's part of a book, Python Data Science, by Jake VanderPlas.
I'm going to show you this very quickly and then indicate where you can stop.
There's an example towards the end,
which is optional, and you don't have to go through.
Then there's our notebook,
which is the introduction to Support Vector Machine.
So, first, this notebook is part
of this book that Jake VanderPlas has written on Python Data Science.
You can, of course, buy the book and support the author,
as well as have your own copy of the material,
but he makes the material available as Python Notebooks online.
So, here, he's talking about motivating SVM,
how you make SVM and to put margin is,
these are all the things that we'll talk about as well,
how to use SVC, et cetera,
until you get down here towards the bottom when it's actually making a specific example.
And I don't ask you to go through this example
because it's a little more complex, this Face Recognition.
So, go through everything up to about here.
If you want to, you can go farther,
but that's a little bit more complex
because what's talking about is actually using images.
Our notebook, on the other hand,
is going to focus more on the Iris dataset,
as well as the Adult and Auto MPG datasets that we've seen before.
The content is going to talk about
the concept of hyperplanes and how to use those to split data,
and also introduce the idea of a non-linear kernel.
Then we're going to get into classification with Support Vector Machines.
We'll look at the Iris dataset.
We'll look at the decision surface with the SVC.
We'll also look at the impact of hyperparameters on the decision surface.
We'll then get into classification on the Adult Data and talk
about SVC on unbalanced classes and how you can try to handle that effect.
I'm also going to introduce the ROC curve or ROC curve and area under the curve or AUC.
And we're going to look at the gain and lift charts as well.
These are some concepts that are important for classification, not just for SVC,
for any classifier, but we're going to introduce some here with the results from SVC.
Lastly, we're going to look at Support Vector Machine: Regression,
and use that to make predictions on the Auto MPG Data.
So, we have our standard setup code,
then we're going to talk about SVC and the importance of hyperplanes.
Support Vector Machine, basically,
works by dividing the data with hyperplanes into resulting classes for classification,
whereas the hyperplanes themselves form the predictive model for progression.
So, it's important to understand what these hyper planes are.
I like to look at this visually.
And so, we can make a plot of the Iris dataset,
and that's what these code cells are doing.
They're making a plot of the Iris dataset and
showing planes that we can split the data with.
We've actually calculate the optimal planes to split these datasets,
these three classes apart from each other.
The way you look at this is,
the first line, up here,
we're splitting these particular classes,
these two classes, the red and the green.
So, the green line is the actual split,
and then the red and the blue are your plus and minus, if you will.
And then with the same thing down here,
we have the plus and minus.
I've also circled in light yellow,
what are known as the support.
These are the data points that determine these boundaries.
So, we're going to split the data equally such that
these supports are uniformly away from that split point.
Here, it's a much tighter split,
and we actually have some data that are
misclassified because they're on the wrong side. That's okay.
That's one of the features of SVC,
it allows for some misclassifications,
if in general, the splits work very well.
Now, technically, there should be three sets of
hyperplanes here because there's three classes.
I've removed one of them,
only showing two, just to make it easier to visualize.
And that's in the code. You can actually take that out and show all three if you want.
I also want to introduce non-linear kernels.
The idea here is that SVC works very well for linear splits.
But if your data is curved or has non-linearities,
SVC is going to have trouble.
The way you handle that data is you apply what's known as the kernel trick where
we transform the data into a space where the data is linearly separated.
Now, this may be confusing.
So, I think an examples are going to be helpful.
What we've done is we first start with some data.
And if you look at this data,
it's actually radially distributed,
that there's effectively a circle around this data
that this is class one and this is class two out here.
So, what we can do is perform a transformation, a polar transformation,
so that now we plot the radius versus the angle,
and you can see that there's a very nice simple linear split.
This is an example of the kernel trick,
transform the data into a space where there is a simple linear split.
We're not asking you in this course to figure out
all the best kernel tricks that you might apply the data.
Second learn has some capability of doing that on its own.
But I want you to understand what that actually means.
We're here from a non-linear split here.
We actually have to use a circle,
which is definitely non-linear,
transform it into a different coordinate system where it is a linear split,
and it's easy then for SVC to make this classification.
So, that brings us to support vector classification.
We have many of the same techniques that we're going to use here as we
did from logistic regression or decision tree classification.
We have some hyperparameters.
One of the most important to see.
This is a penalty term for regularization,
and we would not discuss regularization yet.
So, we're going to set this high to minimize that.
And then there's different kernels that we can apply.
This is to implement the kernel trick.
There is linear, which is, no kernel trick.
There's RBF or radial basis function,
which is similar to what we just showed.
There's polynomials, sigmoid, and even user defined functions.
So, we're going to first use the Iris dataset to demonstrate SVC.
If you look right away here,
you see that it is performing very
accurately better than kernel's neighbor or decision tree,
and the classification report and confusion matrix back that up.
There's only one misclassification.
Now, if you change the random seed,
these results would change.
That's because there is a bit of randomness here in terms of test/train split,
as well as, in some cases,
the different classification algorithms.
We can also look at the decision surface for SVC.
Here, you noticed the nice hyperplanes.
These are lines coming through here,
showing you the SVM split.
We can then vary the hyperparameters and see what happens to our decision surface.
Here, I'm changing the different kernels,
and you can see how the accuracy changes.
In this case, linear works really well, RBF,
the radial bases function,
sigmoid work pretty well as well.
So, here's the nice linear split.
And then, when we change to a radial basis or a polynomial,
noticed you start getting curvatures.
This is because we've transformed the data into a different space.
And when we come back to the original features, that's now curved.
And this shows you how you can handle non-linear decision surfaces
or boundaries with these kernel tricks.
We can also look at how
the more complex data can be classified with this particular technique.
This notebook goes much faster through it because we've already introduced the data.
We can actually just go straight to creating our features and
our labels. We have our data.
We can then actually do a test/train split.
We can then apply a SVM.
In this case, we don't use the standard SVC.
And the reason I do that is because this is a big data set,
and standards SVC can use a lot of memory and be slow.
And if we use linear SVC,
it's going to apply SVC but with a linear kernel.
So, no kernel trick.
It'll be faster if it uses a different underlying implementation.
When we run it, we can see that we get a pretty reasonable accuracy.
Remember, our zero model was around 75.
Then we can perform general classifications.
Again, overall, pretty good.
Not very good right here.
So, that's something that we would want to look at improving,
maybe dropping some features or changing some features.
In this case, we can look at it and say, "Well, you know what?
Our classes were unbalanced.
Maybe we need to do something about that."
So, this is showing again our zero model performance,
and we can then change our hyperparameter such that our class weight is balanced.
And then, when we run it, we notice that our SVC has gotten much better.
And in particular, this particular recall
here for the high income has gotten much better.
Remember, it was about 0.2 before. So it's almost doubled.
Now, when we look at standard performance metrics,
such as precision, recall,
et cetera, there are single number.
A lot of these classification algorithms
now produce probabilistic classifications which we can
use to figure out what's
the optimal threshold in that probability space to get the best classification.
And to do this, we use something called the receiver
operating characteristic curve or ROC curve.
The ROC curve is easier to explain if you look in an example.
And so, we have here,
we compute a logistic regression, a decision tree,
and then a SVC model on our data and plot the ROCs for all three.
So, here, we have, this is random.
If you just guess,
then you're going to get the random curves.
So, obviously, you want to be to the left of this. That's better.
The perfect curve is this yellow line here.
You have true positive versus false positive on the ROC curve.
And so, you won't be as close as possible to this yellow curve.
And you can see here that both the logistic regression and the SVC work very well.
The decision tree, not quite as well.
The other thing you can do is compute the area under this curve or the AUC,
and that's what we're displaying here.
And so, you can see these two models were very similar. You'll see the ROC curve a lot.
It's a great way to compare different models or
the tuning of different models, different hyperparameter combinations.
That's a way to try to figure out what's the best way to do it.
Other things that you might want to look at,
however, are the gain or the lift chart.
And the reason we look at these is because sometimes you want
to do more than just make a classification.
Sometimes you want to optimize the results of
your classification for a particular result.
So, here, we actually compute this.
This function is a little complex,
and it's admittedly so.
So, you don't have to understand it.
It simply understand that it's going to compute the gain for a classification.
And the gain is used to make the gain chart and
then that is also then used to create the lift chart.
This plot makes that, and here's our gain chart.
What this shows you is your baseline prediction,
very similar to your ROC curve,
that as we use more and more test data,
how much gain are we getting?
And the idea is, at some point,
you may stop getting benefits by adding more data into your analysis.
In other words, are there certain customers that you want
to hit first and you're targeting your advertising budget?
Say, you only have $1,000,
and it costs $10 per customer to target them.
You obviously don't want to just randomly grab people.
You want to try to find those customers that are going to be
the best return on value and hit those with your budget first.
That's the idea behind the gain and lift chart.
That was the gain chart. The lift chart is simply
the lift above that random that you saw on that gain.
So, in other words, you want to be above one.
And as we add more and more test data, we're going to approach one.
And the idea is, this allowed you to figure
out what was the customers that you best would get performance from.
And so, this is just decision tree,
and logistic regression, and SVM.
We could also apply SVM to regression tasks.
The differences at now, we're making a continuous prediction.
We use our Auto MPG dataset again.
We grabbed the data if we don't already have it.
We make our analysis,
test/train split, and then we perform SVR.
Our score is pretty low.
And if we look at these in terms of these metrics,
you see that it's higher than it was before.
We might want to play around with changing which features we use.
The mean absolute error is pretty high.
Now we're predicting within five to six miles per gallon.
So, probably, we would want to do better with that if we could.
In the student exercise,
I'll actually gives you some ideas for how you might improve that.
So, I'm going to stop here.
Support Vector Machine, powerful algorithm,
something you definitely want to be familiar with,
along with the other algorithms we talked about in this course,
it helps build that tool kit for you
and applying machine learning to datasets that you may encounter.
If you have any questions, let us know. And good luck.