In this video, I will talk about how we can reduce the problem of object detection to image classification. Image classification give us an accurate way of classifying images. For example, you can train a pedestrian classifier to say whether this particular image contain pedestrian or not. But how can you say where exactly is the pedestrian in the image, and how many pedestrians in this image? The main technique to reduce the problem of detection to image classification is sliding window. Consider a fixed-size rectangular window. You've chosen correctly. The pedestrians will occupy most of the window. You can look through all possible regions by scanning an image left to right, top to bottom with this window. You classify each window using the "is pedestrian?" classifier independently. We mark all windows the classifier has set, yes. There are several inherent problems of sliding windows. How to handle the various objects sizes? How to handle various aspect ratio of objects? A lot of windows will have a good overlap this object. So for each object, we'll get multiple positive responses. What to do if object is partially overlapped with other objects, or its shape is purely by rectangular window? In this last case, the classifier should be really powerful to detect partially overlapped object, or we have that occupy only a small portion of the window. One way to solve the problem of object size is to use several windows of different sizes and scan image several times with different windows. Alternatively, we can downscale images several times and create a multi-scale pyramid. Then we can scan all scales in window of same size. Dependent on generalization ability of classifier, the number of scales can greatly differ. Until recently, there was a quite a lot of scales to achieve reliable detection. These powerful deploy techniques, the number of scales is reduced significantly. If object aspect ratio can significantly differ, like for frontal in profile views of a dog, we need to use windows of different aspect ratio. So, the number of images to be scanned, equals to the product of number of scales and number of aspect ratios. Score map for specific scale can contain multiple responses of various strength. To obtain final detection, we usually select points of local maxima, this non-maximal suppression, similar to how we detect edges in image gradient maps. In multi-scale detection, we should select 3D local maxima. Of course, elaborate learning-based techniques exist to perform non-maximal suppression in detectors. Modern detectors are very powerful and can reliably detect a lot single objects. Detector failures happens mostly for overlapping objects or for small objects. For a conclusion, I can say this, currently, a sliding window is the main approach to object detection. Multiple scales and aspect ratios are handled by search windows of different size and aspects or by image scaling and creating multi-scale image pyramid.