0:00

Very, very deep neural networks are difficult to train because

Â of vanishing and exploding gradient types of problems.

Â In this video, you'll learn about

Â skip connections which allows you to take the activation from

Â one layer and suddenly feed it to another layer even much deeper in the neural network.

Â And using that, you'll build ResNet which enables you to train very, very deep networks.

Â Sometimes even networks of over 100 layers. Let's take a look.

Â ResNets are built out of something called a residual block,

Â let's first describe what that is.

Â Here are two layers of a neural network where you

Â start off with some activations in layer a[l],

Â then goes a[l+1] and then deactivation two layers later is a[l+2].

Â So let's to through the steps in this computation you have a[l],

Â and then the first thing you do is you apply this linear operator to it,

Â which is governed by this equation.

Â So you go from a[l] to compute z[l

Â +1] by multiplying by the weight matrix and adding that bias vector.

Â After that, you apply the ReLU nonlinearity, to get a[l+1].

Â And that's governed by this equation where a[l+1] is g(z[l+1]).

Â Then in the next layer,

Â you apply this linear step again,

Â so is governed by that equation.

Â So this is quite similar to this equation we saw on the left.

Â And then finally, you apply another ReLU operation which is

Â now governed by that equation where G here would be the ReLU nonlinearity.

Â And this gives you a[l+2].

Â So in other words,

Â for information from a[l] to flow to a[l+2],

Â it needs to go through all of these steps which I'm going to call

Â the main path of this set of layers.

Â In a residual net,

Â we're going to make a change to this.

Â We're going to take a[l],

Â and just first forward it, copy it,

Â match further into the neural network to here,

Â and just at a[l],

Â before applying to non-linearity, the ReLU non-linearity.

Â And I'm going to call this the shortcut.

Â So rather than needing to follow the main path,

Â the information from a[l] can now follow

Â a shortcut to go much deeper into the neural network.

Â And what that means is that this last equation

Â goes away and we instead have that the output

Â a[l+2] is the ReLU non-linearity g applied to z[l+2] as before,

Â but now plus a[l].

Â So, the addition of this a[l] here,

Â it makes this a residual block.

Â And in pictures, you can also modify this picture on

Â top by drawing this picture shortcut to go here.

Â And we are going to draw it as it going into this second layer here

Â because the short cut is actually added before the ReLU non-linearity.

Â So each of these nodes here,

Â whwre there applies a linear function and a ReLU.

Â So a[l] is being injected after the linear part but before the ReLU part.

Â And sometimes instead of a term short cut,

Â you also hear the term skip connection,

Â and that refers to a[l] just skipping over a layer or kind of skipping over

Â almost two layers in order to process information deeper into the neural network.

Â So, what the inventors of ResNet,

Â so that'll will be Kaiming He, Xiangyu Zhang,

Â Shaoqing Ren and Jian Sun.

Â What they found was that using residual blocks

Â allows you to train much deeper neural networks.

Â And the way you build a ResNet is by taking many of these residual blocks,

Â blocks like these, and stacking them together to form a deep network.

Â So, let's look at this network.

Â This is not the residual network,

Â this is called as a plain network.

Â This is the terminology of the ResNet paper.

Â To turn this into a ResNet,

Â what you do is you add all those

Â skip connections although those short like a connections like so.

Â So every two layers ends up with

Â that additional change that we saw on

Â the previous slide to turn each of these into residual block.

Â So this picture shows five residual blocks stacked together,

Â and this is a residual network.

Â And it turns out that if you use

Â your standard optimization algorithm such as

Â a gradient descent or one of

Â the fancier optimization algorithms to the train or plain network.

Â So without all the extra residual,

Â without all the extra short cuts or skip connections I just drew in.

Â Empirically, you find that as you increase the number of layers,

Â the training error will tend to decrease after

Â a while but then they'll tend to go back up.

Â And in theory as you make a neural network deeper,

Â it should only do better and better on the training set.

Â Right. So, the theory, in theory,

Â having a deeper network should only help.

Â But in practice or in reality,

Â having a plain network, so no ResNet,

Â having a plain network that is very deep means that

Â all your optimization algorithm just has a much harder time training.

Â And so, in reality,

Â your training error gets worse if you pick a network that's too deep.

Â But what happens with ResNet is that even as the number of layers gets deeper,

Â you can have the performance of the training error kind of keep on going down.

Â Even if we train a network with over a hundred layers.

Â And then now some people experimenting with networks of

Â over a thousand layers although I don't see that it used much in practice yet.

Â But by taking these activations be it X of

Â these intermediate activations and allowing it to go much deeper in the neural network,

Â this really helps with the vanishing and exploding gradient problems

Â and allows you to train

Â much deeper neural networks without really appreciable loss in performance,

Â and maybe at some point, this will plateau, this will flatten out,

Â and it doesn't help that much deeper and deeper networks.

Â But ResNet is not even effective at helping train very deep networks.

Â So you've now gotten an overview of how ResNets work.

Â And in fact, in this week's programming exercise,

Â you get to implement these ideas and see it work for yourself.

Â But next, I want to share with you better intuition or

Â even more intuition about why ResNets work so well,

Â let's go on to the next

Â