So theta j times 0.99 has the effect of

shrinking theta j a little bit towards zero.

So this makes theta j a bit smaller.

And more formally, this makes the square norm of theta j a little bit smaller.

And then after that, the second term here, that's actually

exactly the same as the original gradient descent update that we had,

before we added all this regularization stuff.

So, hopefully this gradient descent, hopefully this update makes sense.

When we're using a regularized linear regression and

what we're doing is on every iteration we're multiplying theta j by a number

that's a little bit less then one, so its shrinking the parameter a little bit, and

then we're performing a similar update as before.

Of course that's just the intuition behind what this particular update is doing.

Mathematically what it's doing is it's exactly gradient descent

on the cost function J of theta that we defined on the previous slide that uses

the regularization term.

Gradient descent was just one of our two algorithms for

fitting a linear regression model.

The second algorithm was the one based on the normal equation,

where what we did was we created the design matrix X where

each row corresponded to a separate training example.

And we created a vector y, so this is a vector, that's an m dimensional vector.

And that contained the labels from my training set.

So whereas X is an m by (n+1) dimensional matrix, y is an m dimensional vector.

And in order to minimize the cost function J, we found that one

way to do so is to set theta to be equal to this.

Right, you have X transpose X, inverse, X transpose Y.

I'm leaving room here to fill in stuff of course.

And what this value for theta does is this

minimizes the cost function J of theta, when we were not using regularization.