0:00

when you boost a neural network one of

Â the choices you get to make is what

Â activation functions use independent

Â layers as well as at the output unit of

Â your neural network so far we've just

Â been using the sigmoid activation

Â function but sometimes other choices can

Â work much better let's take a look at

Â some of the options in the forward

Â propagation steps for the neural network

Â we have these two steps where we use the

Â sigmoid function here so that sigmoid is

Â called an activation function and G is

Â the familiar sigmoid function I equals 1

Â over 1 plus nu to negative G so in the

Â more general case we can have a

Â different function G of Z visually right

Â here where G could be a nonlinear

Â function that may not be the sigmoid

Â function so for example the sigmoid

Â function goes between 0 & 1 an

Â activation function that almost always

Â works better than the sigmoid function

Â is the 10h function or the hyperbolic

Â tangent function so this is Z this is a

Â this is a equals 10 H of Z and this goes

Â between plus 1 and minus 1 the formula

Â for the 10h function is e to the Z minus

Â e to negative V over there some and it's

Â actually mathematically a shifted

Â version of the sigmoid function so as a

Â you know sigmoid function just like that

Â but shift it so that it now crosses a

Â zero zero point and rescale so it goes

Â to G minus one and plus one and it turns

Â out that for hidden units if you let the

Â function G of Z be equal to

Â ten HSV this almost always works better

Â than the sigmoid function because with

Â values between plus one and minus one

Â the mean of the activations that come

Â out of your head in there are closer to

Â having a zero mean and so just as

Â sometimes when you train a learning

Â algorithm

Â you might Center the data and have your

Â data have zero mean using a 10-8 instead

Â of a sigmoid function kind of has the

Â effect of centering your data so that

Â the mean of the data is close to the

Â zero rather than maybe a 0.5 and this

Â actually makes learning for the next

Â layer a little bit easier we'll say more

Â about this in the second course when we

Â talk about optimization algorithms as

Â well but one takeaway is that I pretty

Â much never use the sigmoid activation

Â function anymore

Â the 10-ish function is almost always

Â strictly superior the one exception is

Â for the output layer because if Y is

Â either 0 or 1 then it makes sense for y

Â hat to be a number that you want to

Â output plus between 0 and 1 rather than

Â between minus 1 and 1 so the one

Â exception where I would use the sigmoid

Â activation function is when you're using

Â binary classification in which case you

Â might use the sigmoid activation

Â function for the output layer so G of Z

Â 2 here is equal to Sigma of Z 2 and so

Â what you see in this example is where

Â you might have a ten-inch activation

Â function for the hidden layer and

Â sigmoid for the output layer so the

Â activation functions can be different

Â for different layers and sometimes to

Â denote that the activation functions are

Â different for different layers we might

Â use these square brackets super scripts

Â as well to indicate that G of square

Â bracket 1 may be different than G square

Â bracket 2 red mccain square bracket 1

Â superscript refers to this layer and

Â superscript square bracket 2 refers to

Â the Alpha layer

Â now one of the downsides of both the

Â sigmoid function and the 10h function is

Â that if Z is either very large or very

Â small then the gradient of the

Â derivative or the slope of this function

Â becomes very small so Z is very large or

Â Z is very small the slope of the

Â function you know ends up being close to

Â zero and so this can slow down gradient

Â descent so one of the toys that is very

Â popular in machine learning is what's

Â called the rectified linear unit so the

Â value function looks like this and the

Â formula is a equals max of 0 comma Z so

Â the derivative is 1 so long as Z is

Â positive and derivative or the slope is

Â 0 when Z is negative if you're

Â implementing this technically the

Â derivative when Z is exactly 0 is not

Â well-defined but when you implement is

Â in the computer the often you get

Â exactly the equals 0 0 0 0 0 0 0 0 0 0

Â it is very small so you don't need to

Â worry about it in practice you could

Â pretend a derivative when Z is equal to

Â 0 you can pretend is either 1 or 0 and

Â you can work just fine so the fact that

Â is not differentiable the fact that so

Â here are some rules of thumb for

Â choosing activation functions if your

Â output is 0 1 value if you're I'm using

Â binary classification then the sigmoid

Â activation function is very natural for

Â the upper layer and then for all other

Â units on varalu or the rectified linear

Â unit is increasingly the default choice

Â of activation function so if you're not

Â sure what to use for your head in there

Â I would just use the relu activation

Â function that's what you see most people

Â using these days although sometimes

Â people also use the tannish activation

Â function

Â once this advantage of the value is that

Â the derivative is equal to zero when V

Â is negative in practice this works just

Â fine but there is another version of the

Â value called the least G value will give

Â you the formula on the next slide but

Â instead of it being zero when G is

Â negative it just takes a slight slope

Â like so so this is called the Whiskey

Â value this usually works better than the

Â value activation function although it's

Â just not used as much in practice either

Â one should be fine

Â although if you had to pick one I

Â usually just use the revenue and the

Â advantage of both the value and only key

Â value is that for a lot of the space of

Â Z the derivative of the activation

Â function the slope of the activation

Â function is very different from zero and

Â so in practice using the regular

Â activation function your new network

Â will often learn much faster than using

Â the ten age or the sigmoid activation

Â function and the main reason is that on

Â this less of this effect of the slope of

Â the function going to zero which slows

Â down learning and I know that for half

Â of the range of Z the slope of relu is

Â zero but in practice enough of your

Â hidden units will have Z greater than

Â zero so learning can still be quite mask

Â for most training examples so let's just

Â quickly recap there are pros and cons of

Â different activation functions here's

Â the sigmoid activation function I would

Â say never use this except for the output

Â layer if you are doing binary

Â classification or maybe almost never use

Â this and the reason I almost never use

Â this is because the 10h is pretty much

Â strictly superior so the 10-inch

Â activation function is this and then the

Â default the most commonly used

Â activation function is the Grandview

Â which is this so you're not sure what

Â else to use use this one and maybe you

Â know feel free also to try to leek you

Â really know where um might be

Â 0.01 G Komen Z right so a is the max of

Â 0.01 times Z and Z so that gives you

Â this some Bend in the function and you

Â might say you know why is that constant

Â 0.01 well you can also make that another

Â parameter of the learning algorithm and

Â some people say that works even better

Â but I hardly see people do that so but

Â if you feel like trying it in your

Â application you know please feel free to

Â do so and and you can just see how it

Â works and how long works and stick with

Â it if it gives you good result so I hope

Â that gives you a sense of some of the

Â choices of activation functions you can

Â use in your network one of the themes

Â we'll see in deep learning is that you

Â often have a lot of different choices in

Â how you code your neural network ranging

Â from number of credit units to the

Â chosen activation function to how you

Â neutralize the waves which we'll see

Â later a lot of choices like that and it

Â turns out that is sometimes difficult to

Â get good guidelines for exactly what

Â will work best for your problem so so

Â these three causes I'll keep on giving

Â you a sense of what I see in the

Â industry in terms of what's more or less

Â popular but for your application with

Â your applications video synthesis it's

Â actually very difficult to know in

Â advance exactly what will work best so

Â concrete values would be if you're not

Â sure which one of these activation

Â functions work best you know try them

Â all and then evaluate on like a holdout

Â validation set or like a development set

Â which we'll talk about later and see

Â which one works better and then go of

Â that and I think that by testing these

Â different choices for your application

Â you'd be better at future proofing your

Â neural network architecture against the

Â the distinction sees our problem as well

Â evolutions of the algorithms rather than

Â you know if I were to tell you always

Â use a random activation and don't use

Â anything else that that just may or may

Â not apply for whatever problem you end

Â up working on you know either

Â either in the near future on the distant

Â future all right so that was a choice of

Â activation functions you've seen the

Â most popular activation functions

Â there's one other question that

Â sometimes is ask which is why do you

Â even need to you

Â activation function at all why not just

Â do away with that so let's talk about

Â that

Â in the next video and when you see why

Â new network do means some sort of

Â nonlinear activation function

Â