0:00

Now, let's come back to the mathematical solution to the SVM training that we saw before.

Please note that, this solution depends only on

the dot product of the input vector X but not on X itself.

We see, this both in the function f(X) and in

the objective function that determines coefficient alpha ion, alpha IxR.

But, this means that instead of working with raw inputs given by vectors X,

we could first produce some features f(X).

The whole math we just outlined will stay the same and the only difference in

the results will be that Xs in the functions will be replaced by future values,

f(X) as shown in this formula.

Now, when would you want to use features instead of raw inputs for machinery?

Short answer would be,

well, more or less everywhere.

Many machinery projects proceed with the reprocessing raw data by

making many features by all sorts of transformations of the raw data.

In finance, people often produce hundreds of features for various predictive models,

only to retain a few like four or five most predictive features in the final model.

This is called feature engineering and feature selection methods.

While we will not cover these topics at much depths,

you can read on these methods in the references for additional reading for this week.

Here instead, I want to give you

some simple example of when you may want to use hand engineered features.

Let's recall one of our previous examples in your homework,

when we compared linear regression and neural network matching all linear synthetic data.

So because linear regression is called linear regression,

it can't hold a non-linear data, correct?

Actually, it's not correct,

linear regression can easily work with non-linear data.

For example, if data is quadratic in inputs X you can simply

have to add X square to a set of predictors.

That's all you need to do if you know that your data is quadratic in X.

All you need to do in such regression is to find

coefficients and this procedure remains the

same no matter what sort of non-linear functions of the original inputs we use.

So, let's continue in this speed of

the previous example and consider how we can work with quadratic features.

Let's say, we have points in space are two that is on a plane.

Then each point has coordinates X1 and X2.

If we want to consider quadratic features a proper set will be a set of

three features X1 squared X2 squared and the product of X1 and X2,

which for convenience would get to the scale by square root of 2.

Now, because there are three such quadratic features,

we can view these list of features as a vector in a three dimensional space R cube.

If we now take two such vectors keys according to

two points X and X prime in the original space and compute their dot product,

it turns out to be the square of their dot product,

as you can easily check yourself.

Now, what does this mean for our SVM regression?

Well, because everything in SVM depends only on the linear product of features,

it means that by using quadratic features we

capture quadratic effects in the original data.

By using other more complex features we can capture higher order effects such as cubic,

quartic, and so on.

So, the obvious question here would be fine,

but how we can construct such features for SVM?

Another end related question,

would be is whether we can produce many features?

Going all the way along with this question can we produce an infinite number of features.

Will it help to improve the performance of an SVM?

The SVM provides actually it's own and very interesting answer to these questions.

The SVM answer to these questions amounts to the so called kernel trick.

To understand it, let's come back to

the model answer via a dot product of feature vectors.

This time let's rewrite it by introducing the kernel K of X and Y,

which we define as a dot product of two feature vectors,

f(X) and f(X) prime.

So a kernel is a function of two inputs X and X prime rather than just one data point.

The main idea of SVM is to model directly kernels rather than feature vectors.

Now, why can this idea be useful?

The main reason for this is that there exists the number

of conditions that the function K of x and y should satisfy,

in order to be a valid kernel.

These conditions follow from the requirement that

the valid kernel should be presentable as

a dot product of some high-dimensional

and possibly even infinite dimensional feature vectors.

You can find some details on this conditions in the review by

Smola and Scholkopf and also in books by Bishop or Geron.

The main point is that the machinery community has developed a number of

good kernels with teritical justifications

and with good empirical performance for practical problems.

Let me quickly go over some most popular kernels.

The first possible kernel is just the original dot product of vectors X and X prime.

It's called a linear kernel.

A linear kernel does not capture non-linearities but on the other hand,

it's easier to work with and SVMs with

linear kernels scale up better than with non-linear kernels.

The other type of kernels used in practice is

the so called polynomial kernel which is

a polynomial of dvd of a dot product of two vectors.

This kernel belongs in the more general class of dot products journals which are

always functions of a dot product in the original input space.

One more popular choice is sigmoid kernel given by

hyperbolic tangent function over linear function of the original dot product.

And finally, probably the most popular nomial kernel,

is the Gaussian Radial Basis Function or RBF kernel.

It's given by the exponential of

the negative squirty equidient distance between

points X and X prime scaled by a parameter gamma.

So, if we use a Gaussian RBF kernel,

parameter gamma will serve as a hyper parameter additional to

the original hyper parameters C and Absalom of the SVM.

All these hyper parameters can be

optimized using methods that we discussed in the first course,

namely using either a separate validation test set or relying on cost validation.

If this standard kernels do not work for you for whatever reason

or if you want to try something else you can design your own kernels.

Finally, there are also are a great meek approaches

to a machine learning to finding an optimal kernel.

This is called kernel learning.

As another extension of the basic SVM formulation that we outlined,

there is also a nice version of support vector regression called new SVR.

In this method, parameter X is not set by hands but rather

optimally peaked from the data itself reported to noise in this data.

This method introduces its own hyper parameter

new but I believe all this stuff is for you to read on your own.

And instead, in the next video I will tell you about

some real life projects with SVMs that I used to work with.