0:00

When designing a layer for a ConvNet, you might have to pick,

Â do you want a 1 by 3 filter,

Â or 3 by 3, or 5 by 5,

Â or do you want a pooling layer?

Â What the inception network does is it says,

Â why should you do them all?

Â And this makes the network architecture more complicated,

Â but it also works remarkably well.

Â Let's see how this works.

Â Let's say for the sake of example that you have inputted a

Â 28 by 28 by 192 dimensional volume.

Â So what the inception network or what an inception layer says is,

Â instead choosing what filter size you want in a Conv layer,

Â or even do you want a convolutional layer or a pooling layer?

Â Let's do them all. So what if you can use a 1 by 1 convolution,

Â and that will output a 28 by 28 by something.

Â Let's say 28 by 28 by 64 output,

Â and you just have a volume there.

Â But maybe you also want to try a 3 by 3 and that might output a 20 by 20 by 128.

Â And then what you do is just stack up this second volume next to the first volume.

Â And to make the dimensions match up,

Â let's make this a same convolution.

Â So the output dimension is still 28 by 28,

Â same as the input dimension in terms of height and width.

Â But 28 by 28 by in this example 128.

Â And maybe you might say well I want to hedge my bets.

Â Maybe a 5 by 5 filter works better.

Â So let's do that too and have that output a 28 by 28 by 32.

Â And again you use the same convolution to keep the dimensions the same.

Â And maybe you don't want to convolutional layer.

Â Let's apply pooling, and that has some other output and let's stack that up as well.

Â And here pooling outputs 28 by 28 by 32.

Â Now in order to make all the dimensions match,

Â you actually need to use padding for max pooling.

Â So this is an unusual formal pooling because if you want

Â the input to have a higher than 28 by 28 and have the output,

Â you'll match the dimension everything else also by 28 by 28,

Â then you need to use the same padding as well as a stride of one for pooling.

Â So this detail might seem a bit funny to you now,

Â but let's keep going.

Â And we'll make this all work later.

Â But with a inception module like this,

Â you can input some volume and output.

Â In this case I guess if you add up all these numbers,

Â 32 plus 32 plus 128 plus 64,

Â that's equal to 256.

Â So you will have one inception module input 28 by 28 by 129,

Â and output 28 by 28 by 256.

Â And this is the heart of the inception network which is due

Â to Christian Szegedy, Wei Liu,

Â Yangqing Jia, Pierre Sermanet,

Â Scott Reed, Dragomir Anguelov, Dumitru Erhan,

Â Vincent Vanhoucke and Andrew Rabinovich.

Â And the basic idea is that instead of

Â you needing to pick one of these filter sizes or pooling you want and committing to that,

Â you can do them all and just concatenate all the outputs,

Â and let the network learn whatever parameters it wants to use,

Â whatever the combinations of these filter sizes it wants.

Â Now it turns out that there is a problem

Â with the inception layer as we've described it here,

Â which is computational cost.

Â On the next slide,

Â let's figure out what's the computational cost of this 5 by 5 filter resulting

Â in this block over here.

Â So just focusing on the 5 by 5 pot on the previous slide,

Â we had as input a 28 by 28 by 192 block,

Â and you implement a 5 by 5 same convolution of 32 filters to output 28 by 28 by 32.

Â On the previous slide I had drawn this as a thin purple slide.

Â So I'm just going draw this as a more normal looking blue block here.

Â So let's look at the computational costs of outputting this 20 by 20 by 32.

Â So you have 32 filters because the outputs has 32 channels,

Â and each filter is going to be 5 by 5 by 192.

Â And so the output size is 20 by 20 by 32,

Â and so you need to compute 28 by 28 by 32 numbers.

Â And for each of them you need to do these many multiplications, right?

Â 5 by 5 by 192.

Â So the total number of multiplies you need

Â is the number of multiplies you need to compute each

Â of the output values times the number of output values you need to compute.

Â And if you multiply all of these numbers,

Â this is equal to 120 million.

Â And so, while you can do 120 million multiplies on the modern computer,

Â this is still a pretty expensive operation.

Â On the next slide you see how using the idea of 1 by 1 convolutions,

Â which you learnt about in the previous video,

Â you'll be able to reduce the computational costs by about a factor of 10.

Â To go from about 120 million multiplies to about one tenth of that.

Â So please remember the number 120 so you can compare it

Â with what you see on the next slide, 120 million.

Â Here is an alternative architecture for inputting 28 by 28 by 192,

Â and outputting 28 by 28 by 32, which is falling.

Â You are going to input the volume,

Â use a 1 by 1 convolution to reduce the volume to 16 channels instead of 192 channels,

Â and then on this much smaller volume,

Â run your 5 by 5 convolution to give you your final output.

Â So notice the input and output dimensions are still the same.

Â You input 28 by 28 by 192 and output 28 by 28 by 32,

Â same as the previous slide.

Â But what we've done is we're taking this huge volume we had on the left,

Â and we shrunk it to this much smaller intermediate volume,

Â which only has 16 instead of 192 channels.

Â Sometimes this is called a bottleneck layer, right?

Â 6:53

I guess because a bottleneck is usually the smallest part of something, right?

Â So I guess if you have a glass bottle that looks like this,

Â then you know this is I guess where the cork goes.

Â And then the bottleneck is the smallest part of this bottle.

Â So in the same way, the bottleneck layer is the smallest part of this network.

Â We shrink the representation before increasing the size again.

Â Now let's look at the computational costs involved.

Â To apply this 1 by 1 convolution,

Â we have 16 filters.

Â Each of the filters is going to be of dimension 1 by 1 by 192,

Â this 192 matches that 192.

Â And so the cost of computing this 28 by 28

Â by 16 volumes is going to be well,

Â you need these many outputs,

Â and for each of them you need to do 192 multiplications.

Â I could have written 1 times 1 times 192, right?

Â Which is this. And if you multiply this out,

Â this is 2.4 million,

Â it's about 2.4 million.

Â How about the second?

Â So that's the cost of this first convolutional layer.

Â The cost of this second convolutional layer would be that well,

Â you have these many outputs.

Â So 28 by 28 by 32.

Â And then for each of the outputs you have to apply a 5 by 5 by 16 dimensional filter.

Â And so by 5 by 5 by 16.

Â And you multiply that out is equals to 10.0.

Â And so the total number of multiplications you need to do is the sum of those

Â which is 12.4 million multiplications.

Â And you compare this with what we had on the previous slide,

Â you reduce the computational cost from about 120 million multiplies,

Â down to about one tenth of that,

Â to 12.4 million multiplications.

Â And the number of additions you need to do is

Â about very similar to the number of multiplications you need to do.

Â So that's why I'm just counting the number of multiplications.

Â So to summarize, if you are building a layer

Â of a neural network and you don't want to have to decide,

Â do you want a 1 by 1,

Â or 3 by 3, or 5 by 5, or pooling layer,

Â the inception module let's you say let's do them all,

Â and let's concatenate the results.

Â And then we run to the problem of computational cost.

Â And what you saw here was how using a 1 by 1 convolution,

Â you can create this bottleneck layer

Â thereby reducing the computational cost significantly.

Â Now you might be wondering,

Â does shrinking down the representation size so dramatically,

Â does it hurt the performance of your neural network?

Â It turns out that so long as you implement this bottleneck layer so that within reason,

Â you can shrink down the representation size significantly,

Â and it doesn't seem to hurt the performance,

Â but saves you a lot of computation.

Â So these are the key ideas of the inception module.

Â Let's put them together and in

Â the next video show you what the full inception network looks like.

Â