[SOUND] In this video we will talk about the trick called residual learning that allows building deeper models in vision. So why do we need different models in vision? Presumably, because the practice shows that they achieve probably better results in visual recognition. The motivation for this and the reason for having a deeper model is that depth of internal representation is able to capture the hierarchy of functions that exist in the real world. Throughout the years models with more depth have achieved greater success, both in ImageNet Large Scale Visual Recognition Challenge, and have achieved significant depth from 8 layers to 20, and even more. Residual learning allows stacking more layers without any significant loss in performance. When deeper networks are able to start converging a degradation problem has been exposed. When the network depth is increased, accuracy gets saturated. Which might be not that surprising. But then, it degrades rapidly. Unexpectedly, this degradation is not caused by overfitting, so we can not regularize. And adding more layers to a suitably deep model leads to higher learning error. So that's what's happening when we just try to stack more convolutional layers on top of each other. However, we can tackle this distressing problem by a deep residual learning framework. Recall that eventually we want a network to fit some mapping from the pixel space to the space of labels, decomposed into smaller mappings implemented by layers. Instead of hoping each few stack layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally denoting the desired underlying mapping as age will let the stack known linear layers fit another mapping. F(x), which is H(x)- x. The original mapping is recast into F(x) + x. The hypothesis is that it is easier to optimize the residual mapping, then to optimize the original mapping. To this extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit in identity mapping by of the linear layers. Residual connections can be incorporated not only into simple architecture such as the VGG architecture. And go from one stack of convolutional layers to the other stack of convolutional layers, but it can also be incorporated into more sophisticated convolutional blocks. One example is the inception architecture. That consists of size blocks, and it has been shown to achieve very good performance at relatively low computational cost. The introduction of residual connections in conjunction with a more traditional architecture has yielded state of the art performance in the 2015 large-scale visual recognition challenge. Its performance was similar to the latest generation inception network. This raises the question of whether there really are any benefits in combining the inception architecture with residual connections. In fact there are. When the residual connections were introduced in connection with inception V4 it has yielded a new state of the art, in the next year in 2016, large scale visual recognition challenge. Their resulting network is code named Inception ResNet v2. And it's currently the most advanced convolutional architecture for vision. In fact, there are more advanced architectures for vision that people have been looking at. If you look, for example, at the mathematical expressions for convolution and p-norm sub-sampling, which has max-pooling effectively for p equal infinity, we will see that their mathematical expressions are effectively equivalent. One can therefore ask the question whether and why any special layers such as max-pooling really need to be introduced into the network in the first place. Well, a complete answer of this question is not easy to give. Nevertheless we assume that in general, there is this three possible explanations why pooling can help in CNNs. The first is that p-norm is capable of making the representation in a convolutional neural network more invariant. The second possibility is that the spacial dimension nullity reduction performed by pooling makes covering larger parts of the higher layers possible. And the third possible explanation is a feature wise nature of the pooling operation as opposed to convolution layers where features get mixed could make optimization easier. A research has been conducted where the authors sought to replace the max-pooling with a convolution of a stride greater than one. It turned out that removing any complicated activations, response normalization and max-pooling resulted in a convolutional architecture that was just as effective as the traditional architecture such as that extensively used max-pooling after each convolutional error. This suggests that effective convolutional architectures can be implemented without the redundant layer such as max-pooling whatsoever. Another attempt to extend the established models for a computer vision was the stochastic depth optimization algorithm. Deep networks with stochastic depth is a novel training algorithm that is based on the seemingly contradictory insight that ideally we would like to have a deep network during testing before, because it would perform better but a short network during training. Because it would train faster, and better, and be less prone to vanishing gradiance problem, during both forward, and backward directions of computation. Therefore, we can create a deep residual network that would have as much as 1,000 layers, which would mean enough capacity for feeding arbitrarily complex functions. But then during training we would remove randomly one quarter of its layers independently for each mini-batch of training examples. That would give us a lower expected depth of the network during training. And eventually, consequently we will get 25% training speed up. And 25 relative improvement in error read as was shown in experiments. This feature makes the stochastic depth learning procedure suitable for deep networks with extremely large number of layers, such as 1,000 layer ResNet. In fact, convolutional networks can be substantially deeper, more accurate and more efficient to train if they contain shorter connections between layers close to the input, and those close to the output. Dense convolutional network, or DenseNet, connects each layer to every other layer in the feed-forward fashion. Whearas a traditional convolutional network with outer layers have L connections, one within each layer and its subsequent layer. Dense net has L times L plus 1 divided by 2 direct connections between each and every layer of the network. The dense net is parameter efficient. And that's possibly contrary intuitive effect of the dense connectivity pattern. In fact, it requires fewer parameter than traditional convolutional networks, as there is no need to relearn redundant feature maps. In fact, the depth of each convolutional feature map is as low as 12 added to the previous feature maps. And besides better parameter efficiency, one big advantage of dense net is their improved flow of information and gradient throughout the network. Which makes them easy to train. To summarize this fragment, residual connections help back propagate errors in very deep networks, leading to better generalization. Some research has shown that, in fact we don't really need the pooling or max-pooling operations in a network if we switch the stride of the convolutional layer to be greater than one. So max-pooling doesn't always improve performance of convolutional neural networks. We could use procedures such as stochastic depth to train very deep networks because the network expected depth reduces during training while maintaining the full depth at inference. Such connectivity patterns as those introduced by depth net network is one possibility to build parameter efficient architectures for recognition while maintaining or even improving accuracy and training speed. [SOUND]