In this video I will talk about the application of deep learning into optical flow estimation. The straightforward way is to create a neural network that takes two frames as input and produce optical flow map as output, simple at that. This approach was successfully proposed in FlowNet paper in 2015. Two versions of FlowNet architecture was proposed. In first version, two color frames are concatenated into one image with six channels, instead of three. This image is then passed to the fully convolutional network. In second version, a Siamese network is used. Two frames are separately passed through convolutional layers for future computation. That features are passed to a specific layer which implements patch comparison. It works similar to convolution, but without learnable parameters. This layer, dot-product between feature vectors for pixel in the first image and for pixel in the second image is computed. The results are passed next to the series of convolutional layers to estimate optical flow. Obviously, current optical flow datasets provide insufficient data to train such complex model from scratch. So authors of FlowNet solved this problem by creating a new synthetic dataset called flying chairs. This dataset is rather simple compared to all previous ones. In this data set, various chairs are flying on top of the background. The scenes looks very unrealistic, but it provides a way to generate a lot of data. And it proves to be enough for training the FlowNet to demonstrate accuracy similar to the top optical flow estimation methods. As expected, FlowNet outperforms other existing methods on testing part of flying chairs dataset. But the results are lower on the other dataset. However, the validity of this simple approach is proven in this work. Probably, if you would have enough good training data someday, then it will obtain both fast and accurate flow estimation method with this approach. If you can't get a good solution with one neural network model for the whole problem, then we can integrate convolutional neural networks model into more elaborate frameworks. As I have mentioned previously, optical flow estimation is essentially a pixel correspondence estimation problem, or a matching problem. And CNN models has demonstrated to be a powerful method for pixel matching. In computing the stereo matching cost with a convolutional neural network, a CNN model has been proposed for dense stereo matching. The nine by nine pixel windows are into a Siamese network for comparison. Due to the small window size, we can collect a lot of training examples from each pair of images. We can create a positive pair of matching windows from pixel in the first image and from pixel in the second image, which are close to the true correspondence. All pixels in the second image which are farther from correspondence can give us a negative matching pair. The dense matching of original input images can be breaked down into set of simpler problems with multi-scale matching. First, we downscale input images. Then we apply dense matching to the downscaled version of input images, which is a simpler problem. For downscaled image, pixel displacement between images are shorter, thus the search space is smaller. The number of pixels in downscaled images are also much smaller. So we can apply complicated matching procedures that requires global optimization methods. Then optical flow from downscaled images can be used as a starting point for optical flow estimation on original input images, which is much simpler than to try to estimate optical flow from original images. Many recent state of the art optical flow estimation methods use this approach. For example, in EpicFlow, edge-preserving guided interpolation is applied to sparse matches between images. As of July 2017, the best optical flow estimation methods combined pixel matching by CNN model is guided upscale. In this method, first pixels are mapped to feature vectors this CNN model. This model is similar to the model for stereonation, which I have described previously. The authors have demonstrated that for reasonably high quality, it is enough to make pixel to only ten dimensional vector. Due to the small size of each vector, you can directly compute matching cost volume by dot-product between normalized feature vector in first image and all possible displacement in the second image. From this cost volume, the matches are extracted with semi-global minimization, then guided upscaling is applied. It can be seen from example in this slide. The semi-dense matches obtained from cost volume are already close to the ground truth data. Guided upscaling of optical flow map can be performed by EpicFlow. But, in this paper, authors have additionally used homography based interpolation. For this, image is first segmented and homography transformation is fit to semi-dense matches in each segment. If homography fits pixels of the segment well, then this segment has simple shape and upscaling is performed this homography. Otherwise EpicFlow is used for this segment. In the slide, a comparison between EpicFlow interpolation and new interpolation is demonstrated. Experiments show that new interpolation improved the optical flow. See parts of error maps near the lower border. As a result, the described approach outperforms existing state of the art methods in both speed and accuracy. A number of other of further state of the art methods also rely on pixel matching these CNN models on downscaled images and guided upscaling. [MUSIC]