0:00

Other than convolutional layers,

Â ConvNets often also use pooling layers to reduce the size of the representation,

Â to speed the computation,

Â as well as make some of the features that detects a bit more robust.

Â Let's take a look. Let's go through an example of pooling,

Â and then we'll talk about why you might want to do this.

Â Suppose you have a four by four input,

Â and you want to apply a type of pooling called max pooling.

Â And the output of

Â this particular implementation of max pooling will be a two by two output.

Â And the way you do that is quite simple.

Â Take your four by four input and break it into

Â different regions and I'm going to color the four regions as follows.

Â And then, in the output,

Â which is two by two,

Â each of the outputs will just be the max from the corresponding reshaded region.

Â So the upper left, I guess,

Â the max of these four numbers is nine.

Â On upper right, the max of the blue numbers is two.

Â Lower left, the biggest number is six,

Â and lower right, the biggest number is three.

Â So to compute each of the numbers on the right,

Â we took the max over a two by two regions.

Â So, this is as if you apply a filter size of two

Â because you're taking a two by two regions and you're taking a stride of two.

Â So, these are actually the hyperparameters of

Â max pooling because we start from this filter size.

Â It's like a two by two region that gives you the nine.

Â And then, you step all over two steps to look at this region, to give you the two,

Â and then for the next row,

Â you step it down two steps to give you the six,

Â and then step to the right by two steps to give you three.

Â So because the squares are two by two, f is equal to two,

Â and because you stride by two,

Â s is equal to two.

Â So here's the intuition behind what max pooling is doing.

Â If you think of this four by four region as some set of features,

Â the activations in some layer of the neural network,

Â then a large number,

Â it means that it's maybe detected a particular feature.

Â So, the upper left-hand quadrant has this particular feature.

Â It maybe a vertical edge or maybe a higher or whisker if you trying to detect a [inaudible].

Â Clearly, that feature exists in the upper left-hand quadrant.

Â Whereas this feature, maybe it isn't cat eye detector.

Â Whereas this feature, it doesn't really exist in the upper right-hand quadrant.

Â So what the max operation does is a lots of features detected anywhere,

Â and one of these quadrants , it then remains preserved in the output of max pooling.

Â So, what the max operates to does is really to say,

Â if these features detected anywhere in this filter,

Â then keep a high number.

Â But if this feature is not detected,

Â so maybe this feature doesn't exist in the upper right-hand quadrant.

Â Then the max of all those numbers is still itself quite small.

Â So maybe that's the intuition behind max pooling.

Â But I have to admit,

Â I think the main reason people use max pooling is

Â because it's been found in a lot of experiments to work well,

Â and the intuition I just described,

Â despite it being often cited,

Â I don't know of anyone fully knows if that is the real underlying reason.

Â I don't have anyone knows if that's

Â the real underlying reason that max pooling works well in ConvNets.

Â One interesting property of max pooling is that it has

Â a set of hyperparameters but it has no parameters to learn.

Â There's actually nothing for gradient descent to learn.

Â Once you fix f and s,

Â it's just a fixed computation and gradient descent doesn't change anything.

Â Let's go through an example with some different hyperparameters.

Â Here, I am going to use, sure you have a five by five input

Â and we're going to apply max pooling with a filter size that's three by three.

Â So f is equal to three and let's use a stride of one.

Â So in this case, the output size is going to be three by three.

Â And the formulas we had developed in

Â the previous videos for figuring out the output size for conv layer,

Â those formulas also work for max pooling.

Â So, that's n plus 2p minus f over s plus 1.

Â That formula also works for figuring out the output size of max pooling.

Â But in this example, let's compute each of the elements of this three by three output.

Â The upper left-hand elements,

Â we're going to look over that region.

Â So notice this is a three by three region

Â because the filter size is three and to the max there.

Â So, that will be nine,

Â and then we shifted over by one because which you can stride at one.

Â So, that max in the blue box is nine.

Â Let's shift that over again.

Â The max of the blue box is five.

Â And then let's go on to the next row, a stride of one.

Â So we're just stepping down by one step.

Â So max in that region is nine, max in that region is nine,

Â max in that region,

Â it's now with a two fives, we have maxes of five.

Â And then finally, max in that is eight.

Â Max in that is six,

Â and max in that, this is not [inaudible].

Â Okay, so this, with this set of hyperparameters f equals three,

Â s equals one gives that output shown [inaudible].

Â Now, so far, I've shown max pooling on a 2D inputs.

Â If you have a 3D input,

Â then the outputs will have the same dimension.

Â So for example, if you have five by five by two,

Â then the output will be three by three by two and the way you compute

Â max pooling is you perform the computation

Â we just described on each of the channels independently.

Â So the first channel which is shown here on top is still the same,

Â and then for the second channel, I guess,

Â this one that I just drew at the bottom,

Â you would do the same computation on that slice of

Â this value and that gives you the second slice.

Â And more generally, if this was five by five by some number of channels,

Â the output would be three by three by that same number of channels.

Â And the max pooling computation is done independently on each of these N_C channels.

Â So, that's max pooling.

Â This one is the type of pooling that isn't used very often,

Â but I'll mention briefly which is average pooling.

Â So it does pretty much what you'd expect which is,

Â instead of taking the maxes within each filter,

Â you take the average.

Â So in this example,

Â the average of the numbers in purple is 3.75,

Â then there is 1.25,

Â and four and two.

Â And so, this is average pooling with hyperparameters f equals two,

Â s equals two, we can choose other hyperparameters as well.

Â So these days, max pooling is used much more

Â often than average pooling with one exception,

Â which is sometimes very deep in a neural network.

Â You might use average pooling to collapse your representation from say,

Â 7 by 7 by 1,000.

Â An average over all the [inaudible] ,

Â you get 1 by 1 by 1,000.

Â We'll see an example of this later.

Â But you see, max pooling used much more in the neural network than average pooling.

Â So just to summarize,

Â the hyperparameters for pooling are f,

Â the filter size and s, the stride,

Â and maybe common choices of parameters might be f equals two, s equals two.

Â This is used quite often and this has the effect

Â of roughly shrinking the height and width by a factor of above two,

Â and a common chosen hyperparameters might be f equals two, s equals two,

Â and this has the effect of shrinking

Â the height and width of the representation by a factor of two.

Â I've also seen f equals three, s equals two used,

Â and then the other hyperparameter is just like a binary bit that says,

Â are you using max pooling or are you using average pooling.

Â If you want, you can add an extra hyperparameter

Â for the padding although this is very, very rarely used.

Â When you do max pooling, usually,

Â you do not use any padding,

Â although there is one exception that we'll see next week as well.

Â But for the most parts of max pooling,

Â usually, it does not use any padding.

Â So, the most common value of p by far is p equals zero.

Â And the input of max pooling is that you input a volume of size that,

Â N_H by N_W by N_C,

Â and it would output a volume of size given by this.

Â So assuming there's no padding by N_W minus f over s,

Â this one for by N_C.

Â So the number of input channels is equal to the number of output channels

Â because pooling applies to each of your channels independently.

Â One thing to note about pooling is that there are no parameters to learn.

Â So, when we implement that crop,

Â you find that there are no parameters that backdrop will adapt through max pooling.

Â Instead, there are just these hyperparameters that you set once,

Â maybe set ones by hand or set using cross-validation.

Â And then beyond that, you are done.

Â Its just a fixed function that the neural network computes in one of the layers,

Â and there is actually nothing to learn.

Â It's just a fixed function.

Â So, that's it for pooling.

Â You now know how to build convolutional layers and pooling layers.

Â In the next video,

Â let's see a more complex example of a ConvNet.

Â One that will also allow us to introduce fully connected layers.

Â