So, now we've looked at the Newton–Raphson method,
which uses the gradient to iteratively solve a 1D function of x, say.
Now, we'll generalise that to figure out how to do
something similar with a multi-dimensional function,
a function of multiple variables and how to use the gradient
to find the maxima and minima of such a function.
Later on, this will let us optimise and find
the best fit for the parameters of the function that we're trying to fit.
Say I've got a function like f equals x squared y here.
This function looks like this.
And what I've got here,
is a function that gets big and positive when x is
big and y is positive and it gets negative when y is negative.
So, if I looked down the y axis,
I got a projection of a straight line.
And if I spin to look down the x axis,
I get an upward parabola for y positive and one going down for y negative.
And the function is equal to zero,
along both the axis.
Now, the bottom here in MATLAB,
I've used the surf C function to plot contours when the function has a constant value,
just like the contours of constant height on a topographical map.
So, the question we want to answer in this video is,
how do I find the fastest or steepest way to get down this graph?
Now, previously you found out what the gradient
of a function is with respect each of its axis,
so you can find df dx just by differentiating the function f,
treating all the other variables as constants.
So in this case, we've got a function f of xy is equal to x squared y.
And so, df dx is equal to differentiate
that treating y as a constant, so that's just 2x times y.
And df dy is equal to,
well the x squared, we treat it as constant and y differentiates just to one.
So, it's just x squared.
Now, we can write down these two gradients in a vector which we call the Grad of f. So,
this guy we call Grad and this vector is
just the vector where we write them down
as the components of a vector.
So, we say that Grad is that.
It's just the vector found by writing down df by dx
and df by dy in the x and y locations of the vector.
And Grad is like the most awesome vector ever.
Grad is the little vector that combines linear algebra and calculus together.
So, Grad's a lot of fun.
So, if we went along, a little vector r is equal to dx zero.
Then, if we dotted that with Grad f,
then we get df dx times dx plus zero.
So, we'd move the amount dx along the x axis.
So, this is doing the right sort of thing.
A dot product of Grad F with a vector is going to take,
it looks like some amounts of df dx,
along the x bit of that vector and some amounts of df dy along the y bits of that vector.
So, it's going to tell us how much we've gone down,
that sort of how GRAD works.
And the important thing is to remember,
this is evaluated at f has some values in the space ab.
So, it's evaluated at a location.
So then, if we want to know how much the function will change
when we move along some unit vector in an opposite direction.
So, call that unit vector r hat and we'll give it component c and d. So,
c squared plus d squared is equal to one.
So, df is going to be
df dx times c plus df dy times d. So,
what we're doing here is we're doing Grad of f dotted with r hat.
And we call this thing the directional gradient.
So, another question we can ask is,
what's the maximum value this directional gradient can take?
So, we've got Grad F as a vector and we're dotting it with r hat and our question is,
what's the maximum value that can take,
given that r hat is a unit vector?
Well, the only thing then left when we dotted them together is going
to be the cos theta term because it's ab cos theta.
And that's going to be maximised when cos theta is one,
which means theta is zero which means r hat is parallel to Grad f. So,
in order to find the maximum value of the directional gradient,
we want an r hat that's actually the normalised version of Grad.
So, we can write that down. We can do that calculation.
So, we'll have Grad f dotted with the normalised version of itself,
Grad f divided by its modulus.
But Grad f dotted with itself is just equal to the sise of Grad f squared,
got to divide by the sise.
So, the maximum value of the directional gradient can take is
just the sise of Grad f and that's therefore,
the steepest gradient we can possibly have.
The sise of the steepest gradient we could possibly have is just the sise of Grad f,
the sum of the squares of the components of Grad.
So, that's really fun.
The other question to ask is,
which way does Grad point?
This is less easy to see but Grad points up the direction of steepest descent,
perpendicular to the contour lines.
So think, if you're in a mountain and it's foggy,
and you can only see locally around you and you want to go down the hill.
You don't walk around the contour,
you go down the hill.
So, if the contour is like this,
so I'm just staring down the hill.
Down the hill is at 90 degrees to the contour line.
And think about the way to find if it is up or down.
Df dx is positive if you're going up.
So, actually Grad f points up the hill in
the steepest way and minus Grad points down the hill, the steepness way.
So, Grad points up the direction the steepest descent.
Now, if we have some data science problem where we want to
minimise the difference between our data values and our model fit,
then what we want to do is find the lowest point in the function.
The function is kind of the badness.
We want find the best so we want to find the point where the badness is minimised.
And like Newton-Raphson, we can use the gradient to
go from some trial point down towards the solution.
But in Newton-Raphson, we're trying to find the zero point,
here we don't know what the minimum value of the function is.
So, we don't know how far down we need to go.
It's as if we're somewhere on a mountain but we don't know the altitude of the valley.
So, what we do and what's called the gradient descent method is,
we take a series of little steps down the hill,
that is if we started at some position Sn,
then our next position Sn plus one is given by Sn
plus some little step down the hill and that little step is given by
minus some amount times Grad.
And Grad evaluated at the previous position Sn.
On the graph that's going to look like,
taking a little step,
the little blue one, down the hill and then,
we re-evaluate to make another Sn plus one and make
another step and take a series of steps down the hill.
If we overshoot, that's okay,
because Grad will just take us back to the minimum.
And notice that as the gradient gets shallower as we come towards the turning point,
then the steps automatically get smaller because Grad gets smaller.
So, this is quite a nice method.
There are lots of ways to enhance it,
but that's the main idea.
It's very powerful and simple.
The other thing to notice is that there are multiple local minimum on the landscape,
then we might get stuck in one of them.
And of course which one we find depends on our starting point.
We won't find the others, we'll only find one at a time.
So in the general case,
there are some problems but nevertheless this is
quite a neat method just following little steps down the hill.
And this is probably the main way in which most
numerical optimisers work that people use in the real world.
So, what we've done in this video is we've
introduced ourselves to Grad, the gradient vector,
and merged calculus and vectors together to take
our first steps in vector calculus and that's allowed us to
find a method for just using the gradient to step our way towards solving
a problem for multi-variable cases and that method is called Gradient Descent.