0:30

So let's say convolving with the first

Â filter gives this first 4 by 4 output, and

Â convolving with this second filter gives a different 4 by 4 output.

Â The final thing to turn this into a convolutional neural net layer,

Â is that for each of these we're going to add it bias,

Â so this is going to be a real number.

Â And where python broadcasting, you kind of have to add the same number so

Â every one of these 16 elements.

Â And then apply a non-linearity which for this illustration that says relative

Â non-linearity, and this gives you a 4 by 4 output, all right?

Â After applying the bias and the non-linearity.

Â And then for this thing at the bottom as well, you add some different bias, again,

Â this is a real number.

Â So you add the single number to all 16 numbers, and

Â then apply some non-linearity, let's say a real non-linearity.

Â And this gives you a different 4 by 4 output.

Â Then same as we did before, if we take this and

Â stack it up as follows, so we ends up with a 4 by 4 by 2 outputs.

Â Then this computation where you come from a 6 by 6 by 3 to 4 by 4 by 4,

Â this is one layer of a convolutional neural network.

Â So to map this back to one layer of four propagation in the standard neural

Â network, in a non-convolutional neural network.

Â Remember that one step before the prop was something like this, right?

Â z1 = w1 times a0, a 0 was also equal to x,

Â and then plus b[1].

Â And you apply the non-linearity to get a[1], so that's g(z[1]).

Â So this input here, in this analogy this is a[0], this is x3.

Â 2:44

And these filters here,

Â this plays a role similar to w1.

Â And you remember during the convolution operation, you were taking

Â these 27 numbers, or really well, 27 times 2, because you have two filters.

Â You're taking all of these numbers and multiplying them.

Â So you're really computing a linear function to get this 4 x 4 matrix.

Â So that 4 x 4 matrix, the output of the convolution operation,

Â that plays a rolesimilar to w1 times a0.

Â That's really maybe the output of this 4 x 4 as well as that 4 x 4.

Â And then the other thing you do is add the bias.

Â So, this thing here before applying value,

Â this plays a role similar to z.

Â And then it's finally by applying the non-linearity, this kind of this I guess.

Â So, this output plays a role,

Â this really becomes your activation at the next layer.

Â So this is how you go from a0 to a1, as far as tthe linear operation and

Â then convolution has all these multipled.

Â So the convolution is really applying a linear operation and

Â you have the biases and the applied value operation.

Â And you've gone from a 6 by 6 by 3, dimensional a0,

Â through one layer of neural network to,

Â I guess a 4 by 4 by 2 dimensional a(1).

Â And so 6 by 6 by 3 has gone to 4 by 4 by 2, and so

Â that is one layer of convolutional net.

Â 4:33

Now in this example we have two filters, so we had two features of you will,

Â which is why we wound up with our output 4 by 4 by 2.

Â But if for example we instead had 10 filters instead of 2,

Â then we would have wound up with the 4 by 4 by 10 dimensional output volume.

Â Because we'll be taking 10 of these naps not just two of them, and stacking them

Â up to form a 4 by 4 by 10 output volume, and that's what a1 would be.

Â So, to make sure you understand this, let's go through an exercise.

Â Let's suppose you have 10 filters, not just two filters, that are 3 by 3 by 3 and

Â 1 layer of a neural network, how many parameters does this layer have?

Â 5:21

Well, let's figure this out.

Â Each filter, is a 3 x 3 x 3 volume, so 3 x 3 x 3,

Â so each fill has 27 parameters, all right?

Â There's 27 numbers to be run, and plus the bias.

Â 5:50

And then if you imagine that on the previous slide we had drawn two filters,

Â but now if you imagine that you actually have ten of these, right?

Â 1, 2..., 10 of these,

Â then all together you'll have 28 times 10,

Â so that will be 280 parameters.

Â Notice one nice thing about this, is that no matter how big the input image is,

Â the input image could be 1,000 by 1,000 or 5,000 by 5,000,

Â but the number of parameters you have still remains fixed as 280.

Â And you can use these ten filters to detect features, vertical edges,

Â horizontal edges maybe other features anywhere even in a very,

Â very large image is just a very small number of parameters.

Â 6:40

So these is really one property of convolution neural network that

Â makes less prone to overfitting then if you could.

Â So once you've learned 10 feature detectors that work,

Â you could apply this even to large images.

Â And the number of parameters still is fixed and relatively small,

Â as 280 in this example.

Â All right, so to wrap up this video let's just summarize the notation we

Â are going to use to describe one layer to describe a covolutional layer in

Â a convolutional neural network.

Â So layer l is a convolution layer,

Â l am going to use f superscript,[l] to denote the filter size.

Â So previously we've been seeing the filters are f by f, and

Â now this superscript square bracket l just denotes that this is

Â a filter size of f by f filter layer l.

Â And as usual the superscript square bracket l is the notation we're using to

Â refer to particular layer l.

Â 7:39

going to use p[l] to denote the amount of padding.

Â And again, the amount of padding can also be specified just by saying that you want

Â a valid convolution, which means no padding, or

Â a same convolution which means you choose the padding.

Â So that the output size has the same height and width as the input size.

Â 8:03

Now, the input to this layer is going to be some dimension.

Â It's going be some n by n by number of channels in the previous layer.

Â Now, I'm going to modify this notation a little bit.

Â I'm going to us superscript l- 1,

Â because that's the activation from

Â the previous layer, l- 1 times nc of l- 1.

Â And in the example so far, we've been just using images of the same height and width.

Â That in case the height and width might differ,

Â l am going to use superscript h and superscript w, to denote the height and

Â width of the input of the previous layer, all right?

Â So in layer l, the size of the volume will be nh

Â by nw by nc with superscript squared bracket l.

Â It's just in layer l, the input to this layer Is whatever you had for

Â the previous layer, so that's why you have l- 1 there.

Â And then this layer of the neural network will itself output the value.

Â So that will be nh of l by nw of l, by nc of l,

Â that will be the size of the output.

Â And so whereas we approve this set that the output volume size or

Â at least the height and weight is given by this formula,

Â n + 2p- f over s + 1, and then take the full of that and round it down.

Â In this new notation what we have is that the outputs value that's in layer l,

Â is going to be the dimension from the previous layer,

Â plus the padding we're using in this layer l,

Â minus the filter size we're using this layer l and so on.

Â And technically this is true for the height, right?

Â So the height of the output volume is given by this, and you can compute it

Â with this formula on the right, and the same is true for the width as well.

Â So you cross out h and

Â throw in w as well, then the same formula with either the height or

Â the width plugged in for computing the height or width of the output value.

Â 10:36

So that's how nhl -1 relates to nhl and wl- 1 relates to nwl.

Â Now, how about the number of channels, where did those numbers come from?

Â Let's take a look, if the output volume has this depth,

Â while we know from the previous examples that that's equal

Â to the number of filters we have in that layer, right?

Â So we had two filters, the output value was 4 by 4 by 2, was 2 dimensional.

Â And if you had 10 filters and your upper volume was 4 by 4 by 10.

Â So, this the number of channels in the output value,

Â that's just the number of filters we're using in this layer of the neural network.

Â Next, how about the size of this filter?

Â Well, each filter is going to be fl by fl by 100 number, right?

Â So what is this last number?

Â Well, we saw that you needed to convolve a 6 by 6 by 3 image,

Â with a 3 by 3 by 3 filter.

Â 11:43

And so the number of channels in your filter, must match the number of channels

Â in your input, so this number should match that number, right?

Â Which is why each filter is going to be f(l) by f(l) by nc(l-1).

Â And the output of this layer often apply devices in non-linearity,

Â is going to be the activations of this layer al.

Â And that we've already seen will be this dimension, right?

Â The al will be a 3D volume,

Â that's nHl by nwl by ncl.

Â And when you are using a vectorized implementation or batch gradient

Â descent or mini batch gradient descent, then you actually outputs Al,

Â which is a set of m activations, if you have m examples.

Â So that would be M by nHl, by nwl by ncl right?

Â If say you're using bash grading decent and

Â in the programming sizes this will be ordering of the variables.

Â And we have the index and the trailing examples first,

Â and then these three variables.

Â Next how about the weights or the parameters, or kind of the w parameter?

Â Well we saw already what the filter dimension is.

Â So the filters are going to be f[l] by f[l] by nc [l- 1],

Â but that's the dimension of one filter.

Â How many filters do we have?

Â Well, this is a total number of filters, so

Â the weights really all of the filters put together will have dimension given

Â by this, times the total number of filters, right?

Â Because this, Last quantity is the number of

Â filters, In layer l.

Â 13:45

And then finally, you have the bias parameters, and

Â you have one bias parameter, one real number for each filter.

Â So you're going to have, the bias will have this many variables,

Â it's just a vector of this dimension.

Â Although later on we'll see that the code will be

Â more convenient represented as 1 by 1 by 1 by nc[l]

Â four dimensional matrix, or four dimensional tensor.

Â 14:16

So I know that was a lot of notation, and

Â this is the convention I'll use for the most part.

Â I just want to mention in case you search online and look at open source code.

Â There isn't a completely universal standard convention about the ordering of

Â height, width, and channel.

Â So If you look on source code on GitHub or these open source implementations,

Â you'll find that some authors use this order instead, where you first put

Â the channel first, and you sometimes see that ordering of the variables.

Â And in fact in some common frameworks, actually in multiple common frameworks,

Â there's actually a variable or a parameter.

Â Why do you want to list the number of channels first, or

Â list the number of channels last when indexing into these volumes.

Â I think both of these conventions work okay, so long as you're consistent.

Â And unfortunately maybe this is one piece of annotation where

Â there isn't consensus in the deep learning literature but

Â i'm going to use this convention for these videos.

Â 15:24

Where we list height and width and then the number of channels last.

Â So I know there was certainly a lot of new notations you could use, but you're

Â thinking wow, that's a long notation, how do I need to remember all of these?

Â Don't worry about it, you don't need to remember all of this notation, and

Â through this week's exercises you become more familiar with it at that time.

Â But the key point I hope you take a way from this video,

Â is just one layer of how convolutional neural network works.

Â And the computations involved in taking the activations of one layer and

Â mapping that to the activations of the next layer.

Â And next, now that you know how one layer of the compositional neural network works,

Â let's stack a bunch of these together to actually form a deeper compositional

Â neural network.

Â Let's go on to the next video to see,

Â