0:00

in a previous video you saw how to

Â compute derivatives and implement

Â gradient descent with respect to just

Â one training example for religious

Â regression now we want to do it for Emma

Â training examples to get started let's

Â remind ourselves that the definition of

Â the cost function J cost function WP

Â which you care about is this average

Â right 1 over m sum from I equals 1 to M

Â you know the loss when your algorithm

Â output a I on the example why we're you

Â know AI is the prediction on the I've

Â trained example which is Sigma of Z I

Â which is equal to Sigma of W transpose X

Â plus B ok so what we show in the

Â previous slide is for any single

Â training example how to compute LD

Â derivatives when you have just one

Â training example great so d w1 d w2 and

Â d be with now the superscript I to

Â denote the corresponding values you get

Â if you are doing what we did on the

Â previous slide but just using the one

Â training example X I Y I I was using it

Â missing on either as well so now you

Â notice the overall cost functions is sum

Â was really the average because the 1

Â over m term of the individual losses so

Â it turns out that the derivative respect

Â to say w1 of the overall cost function

Â is also going to be the average of

Â derivatives respect to w1 of the

Â individual loss terms but previously we

Â have already shown how to compute this

Â term as say d w1 I right which we you

Â know on the previous slide show how the

Â computers on a single training example

Â so what you need to do is really compute

Â these own derivatives as we showed on

Â the previous training example and

Â average them and this will give you the

Â overall gradient that you can use to

Â implement

Â straight into scent so I know there was

Â a lot of details but let's take all of

Â this up and wrap this up into a concrete

Â algorithms and what you should implement

Â together logistic regression with

Â gradient descent working so just what

Â you can do let's initialize J equals 0

Â on DW 1 equals 0 DW 2 equals 0 DB equals

Â 0 and what we're going to do is use a

Â for loop over the training set and

Â compute the derivatives to respect each

Â training example and then add them up

Â all right so see as we do it for I

Â equals 1 through m so M is the number of

Â training examples we compute CI equals W

Â transpose X I plus B armed the

Â prediction AI is equal to Sigma of zi

Â and then you know let's let's add up j j

Â plus equals y i long a I M plus 1 minus

Â y I log 1 minus AI and then put a

Â negative sign in front of the whole

Â thing and then as we saw earlier we have

Â d zi or it is equal to AI minus y i and

Â DW gets plus equals x1 i d zi b w2 plus

Â equals x i2 d zi or and i'm doing this

Â calculation assuming that you have just

Â be two features so the n is equal to 2

Â otherwise you do this for d w1 z w2 TW 3

Â and so on and GB plus equals V V I and I

Â guess that's the end of the for loop and

Â then finally having done this for all M

Â training examples you will still need to

Â divide by M because we're computing

Â averages so d w1

Â if I equals m DW to divide calls m DB

Â device equals M in all the complete

Â averages and so with all of these

Â calculations you've just computed the

Â derivative of the cost function J with

Â respect to e three parameters W 1 W 2

Â and B so the comment details what we're

Â doing we're using DW 1 + DW and DP

Â - as accumulators right so that after

Â this computation you know DW 1 is equal

Â to the derivative of your overall cost

Â function with respect to W 1 and

Â similarly for DW 2 and DV so notice that

Â DW 1 + DW to do not have a superscript I

Â because we're using them in this code as

Â accumulators to sum over the entire

Â training set whereas in contrast bzi

Â here this was on P Z with respect to

Â just one single training example that is

Â why that has a superscript I to refer to

Â the one training example either that's

Â computer on and so having finished all

Â these calculations to implement one step

Â of gradient descent you implement w1

Â gets updated as w1 - a learning rate

Â times d w1 w2 gives updates w2 one is

Â learning rate times d w2 and B gives

Â update as B - learning rate times EB

Â where PW 1 DW 2 + DB where you know as

Â computed and finally J here would also

Â be a correct value for your cost

Â function so everything on the slide

Â implements just one single step of

Â gradient descent and so you have to

Â repeat everything on this slide multiple

Â times in order to take multiple steps of

Â gradient descent in case these details

Â seem too complicated

Â again don't worry too much about it for

Â now hopefully all this will be clearer

Â when you go and implement this in D

Â programming assignment but it turns out

Â there are two weaknesses with the

Â calculation as with as with implemental

Â adhere which is that to implement

Â logistic regression this way you need to

Â write two for loops the first for loop

Â is a small loop over the M training

Â examples and the second for loop is a

Â for loop over all the features over here

Â right so in this example we just had two

Â features so n is 2 equal to 2 and X

Â equals 2 but if you have more features

Â you end up writing your DW 1 DW 2 and

Â you have similar computations for DW v

Â and so on down to DW n so seems like you

Â need to have a for loop over the

Â features over all n features

Â when you're implementing deep learning

Â algorithms you find that having explicit

Â for loops in your code makes your

Â algorithm run less efficiency and so in

Â the deep learning error would move to a

Â bigger and bigger data sets and so being

Â able to implement your algorithms

Â without using explicit for loops is

Â really important and will help you to

Â scale to much bigger data sets so it

Â turns out that there are set of

Â techniques called vectorization

Â techniques that allows you to get rid of

Â these explicit full loops in your code I

Â think in the pre deep learning era

Â that's before the rise of deep learning

Â vectorization was a nice to have you

Â could sometimes do it to speed a vehicle

Â and sometimes not but in the deep

Â learning era vectorization that is

Â getting rid of for loops like this and

Â like this has become really important

Â because we're more and more training on

Â very large datasets and so you really

Â need your code to be very efficient so

Â in the next few videos we'll talk about

Â vectorization and how to implement all

Â this without using even a single full

Â loop so of this I hope you have a sense

Â of how to intimate logistic regression

Â or gradient descent for logistic

Â regression on things will be clearer

Â when you implement the program exercise

Â but before actually doing the program

Â exercise let's first talk about

Â vectorization so then you can implement

Â this whole thing implement a single

Â iteration of gradient descent without

Â using any fall news

Â