In this video, we'll talk about gradient descent,

a generic method that can optimize any differentiable loss function.

So, we already know loss functions for regression,

like mean squared error,

or for classification, like cross-entropy.

Of course, there are many other loss functions,

and it would be good to have some generic method that

can take any differentiable loss function and find its minimum.

And this method is gradient descent and other is extensions.

So suppose that we have some loss function L(w) and we want to minimize it.

How can we do it?

Let's take some initialization w_zero.

It's just some point on our surface.

And of course the surface of our function can be very difficult.

It can have multiple minima,

maxima, like for example on this graph.

And we want to find some local minimum of our function.

Okay. So how can we improve w_zero?

We should somehow find a way,

find a direction where the function decreases in this point,

and take a step in this direction.

So to do it we can remember some calculus.

There is a gradient vector that is essentially a vector of

partial derivatives with respect of all parameters of our function, of all w's,

and gradient points as the direction of steepest ascent of

our function and minus gradient points

as the direction of steepest descent of our function.

So, if we want to minimize our function,

to minimize the value of loss and w_zero,

we should just calculate gradient at the point w_zero and step in

the direction of anti-gradient, of minus gradient.

So, we take w_zero,

calculate gradient which is denoted by

nebula L of w_zero multiplied with some learning rate,

with some coefficient Eta t, Eta one,

and we subtract these gradient multiplied by Eta one from the approximation w_zero.

And this how we get the next approximation w_1.

Then we can continue this process.

So gradient descent looks like that.

We initialize our parameters by w_zero and on each step,

we calculate the gradient and take a step in the anti-gradient at this point,

and then we check some stopping criteria.

So, for example, if w_t,

the parameter vector at this step is very close to w_t minus one,

does it previous parameters,

then we stop because it looks like we've achieved some minimum.

For example, if the level lines of our function look like this one,

we take some initialization w_zero. We calculate gradient.

We step in the direction of anti-gradient.

Get to the point w_one then we again calculate the gradient at this point,

take a step etc.

And at some point,

we converge to the local minimum of this function.

Of course, there are many questions to gradient descent.

For example, how to initialize w_zero?

Or how to select a step size Eta t?

Or it should be constant or change at reiteration, that is a question.

Or when to stop?

How to choose a stopping criterium?

We've discussed one of criteria,

like checking whether our new parameter vector is close to the previous one,

but there are many other criteria.

And of course, to calculate the gradient of our function,

we should calculate the gradient for every example from our training set because

loss function is essentially a sum of losses on each example from training set.

And these could be hard. And in practice,

we usually approximate our gradient vector

by some methods that we'll discuss in following videos.

Now, when we have gradient descent,

we can take any loss function,

for example, min squared error.

We can calculate the gradient that looks like X transposed multiplied by

Xw minus y and takes steps in the direction of this gradient.

As you remember, there is an analytical solution

for min squared error and linear regression,

but is inferior to gradient descent because first of all,

gradient descent is easier to implement.

You don't need to solve some systems of

linear equations to calculate the minimum of our function.

Gradient descent is a very general framework

that allows us to minimize any differentiable loss function,

not only min squared error but cross-entropy and all other losses.

And also, if we use stochastic versions of gradient descent, that approximate gradients,

then this method is much more effective both in terms of computations and memory.

In this video, we discussed gradient descent,

a method that can optimize any differentiable function,

and discussed that it has many questions,

like how to choose learning rate,

or how to initialize w,

or some other questions.

And we'll discuss them in following videos and in following weeks of our course.