In this video, I will talk about visual tracking algorithms. There are a lot of different methods. Compared, for example, to object detection, it's difficult to pinpoint similar prominent tracking methods as [inaudible] detector or faster [inaudible] scheme. So I have selected several examples to illustrate key ideas. Let me remind you that visual trackers are model-free. Because they don't know anything about objects to track except the bounding box in the first frame. We can train object detector specifically for this task. In this regard, the usual tracking problem is similar to the image retrieval. In image retrieval, they rank images according to visual similarity of the probe image to the image in the gallery. In visual tracking, we search in the next frame for a region that is visually similar to the specified region in the first frame. Object tracking works in iterative fashion. First, object model initialized in first frame. Object models can be very different, but usually, object is represented by feature vector that describe object visual appearance as it is done in image retrieval. Then, starting from the position in the previous frame, we search in the current frame for a region which will be visually similar to the region in the previous frame. This can be done by sampling candidate regions in the neighborhood, computing feature vector for each candidate region and computing distance between candidate region and object model using the feature vectors. The sampling of candidate windows is similar to sliding windows in object detection. Measuring the visible similarity between images is similar to image retrieval. What types of models can be used for object representation in visual tracking? Several types can be identified. First is an object template, this example or instance of object of interest. Second, object can be represented as a set of object parts or fragments of key points. Third is a vector of appearance features. Color histogram is an example of such a vector of appearance. The most basic example of object template is a grayscale image of object instance. You can search for a new location of template in the next image by scanning the image as a template as in sliding windows. For each position of a sliding window will make per-pixel comparison of the template and candidate window with image distance metric. Some square distances on normalized cross correlations are two examples such metrics. For example, you can build a very simple TV remote control system by tracking of a human palm despite matching. For each candidate location, a score is computed by image distance metric. And the location of maximum score value is selected as the position of object in the current frame. Visual tracking by Python mentioned this normalized cross-correlation is a baseline visual tracking method. Despite its simplicity, such methods can be efficiently used for short term tracking in practical applications. For example, we can use this tracking methods to track people faces between detections. Second example is color-based tracking. Color is a powerful feature. We can track object of interest using color information if object color is different from its immediate surrounding. In some cases, as for the circuit players, the color of the players is quite prominent compared to the field. So tracking by color could be enough. To use color as a feature for tracking, you should estimate the likelihood of color sample to belong to objects or to the background. First, the computer color histogram for the objects creation inside the bounding box of the object. Second, the computer color histogram for the object neighborhood. Then, for each color value, we compute likelihood ratio of the pixel to belong to object or to the background. This is an object background model. Then, we apply this model for each pixel in the current frame and segment a region that is highly likelihood pixels, or select a candidate region that maximizes object likelihood, or the sum of likelihood of pixels in this region. Hue color object tracking is also regarded as a basic baseline method, and it's usually performed by more elaborate methods. However, as shown in the paper in defense of color based model-free tracking, rather simple modification can significantly improves the performance of color-based trackers. One of the key problems of color-based tracking is that the objects of interest can have similar appearance to other objects in the theme. Visual tracker tends to drift and can switch to tracking the similar objects. For example, color model of athlete visible ground is similar to all athlete in the image. Objects of similar appearances are distractors for the visual tracking. You can try to separate them directly. First, we need to detect distractors. So we apply object color model to the current image. Regions with high likelihood pixels are marked as distractors. Then, you compute color histogram for the distractor region. After that, it can estimate object likelihood at position X with respect to object color relative to the distractors. This will be object distractor model. We combine object background model with object distractor model in a linear combination. As seen in this example, applying the combined object model yields higher likelihood scores for discriminative object pixels while simultaneously decreasing the impact of distractor regions. The person of interest stands out against the background similar to the other athletes, but our object distractor model allows to discriminate between object pixels and distractor pixels. Color distributions rely on pixel values to discriminate target from background. New location information is preserved. Thus, such features are robust to object shape changes but sensitive to the blur and poor illumination. Template models rely on spatial configuration. So their robust to blur and poor illumination but sensitive to changes in shape. By combining both features in a single model, you can make a tracker robust to both shape change and blur. Staple, or Sum of Template and Pixel-wise Learners, is an example of such methods. As of July 2017, it is one of the fastest methods among the top performing. Usually, best performing methods can process less than one frame per second. And Staple is more than ten times faster than that. Now, let's talk about convolutional neural networks. How to apply neural networks to visual tracking problem? Because modern CNN have mostly been introduced on the recently, probably, we don't know the best approach. I have selected two methods to demonstrate current approaches. First can be seen as a natural evolution of visual trackers and object detector simultaneously. We train convolutional neural network as object to this background classifier and apply it to the sampled candidate regions in the vicinity of the expected object position in current frame. The score of the classifier is a measure of visual similarity between candidate window and an object. And we see that window with the highest score is an output. The classifier is trained online on the target video and is regularly updated to handle changes of object appearance. The second approach is generic visual tracking. The idea is to train convolutional neural network that can regress the new position of any object which is positioned in the center of the previous frame. Such network takes two images as input. First image is the crop of the previous frame, centered at the object of interest. Second image is the crop of the current frame. The output in the new bounding box position in the current frame. Such network should be very fast because it doesn't need online training. The example of the first approach is the multi-domain convolutional neural network for visual tracking. There's a classifier we just trained to distinguish object in background regions. Surely, almost anything can be the object of interest, so we can't train this network beforehand for all videos and for all type of object of interest. But it's also impossible to train such network from scratch based only on the target video. So, we divide neural network into shared and domain-specific layers. We treat each video as a domain. We train the shared component offline on all available domains. Then, we train the domain-specific component for the target video online on the target video. The authors of this rock have selected VGG architecture for their net. First five layers of the net is a shared component. The output of the last shared layer is branched into domain-specific layers. Each domain-specific layer is an object of background classifier for a specific video. To improve the localization of object, a bounding box regressor is also trained for each domain-specific branch. Domain-specific classifier is updated regularly during tracking to handle object appearance changes. Due to the complexity of the regressor, it is trained only once in the first frame of the video. Training process is divided into iterations. During each iteration, only one video is considered. So only one domain-specific branch is trained. As positive samples, candidate windows with IOU larger than 0.7 are sampled. As negative examples, candidate windows with IOU lower than 0.3 are sampled. To improve the training, the hard negative modern procedures are applied, same as for the object detectors. Current classifier is applied to the candidate windows, and highest score in windows with low IOU are selected as negative samples. Classifier is then fine-tuned on the updated training dataset. The tracking algorithm, research MDNet, is following. First, only shared component of the net are retained. A new domain-specific branch is randomly initialized. Bounding vox regression model is then trained. Positive and negative samples are extracted from the first frame and from training dataset. Domain-specific layer and two astral layers are updated on this dataset. Then, for each new frame, a set of target candidate samples are drawn and scored with the net. The highest score and sample is selected. If the score is larger than 0.5, then new positive and negative samples are extracted from this frame and added to the dataset. If highest score is lower than 0.5, then model is updated using the current dataset. Each 10 frames in the model is also updated using long term dataset. MDNet and its extensions are one of the best performing visual trackers in terms of robustness and accuracy according to the vogue challenge results. Several hot examples are demonstrated on this slide from the vogue dataset. The results of MDNet are marked with red bounding boxes. You can see that it successfully tracks object of interest in cases where other trackers fail. However, the MDNet is very slow. It can be used only when high accuracy is required and real time performance is not an issue. The example of the second approach is the Generic Object Tracking Using a Regressional Network, or simply GOTURN, that has been proposed in learning to track 100 frames per second is deep regression network paper. As I have said previously, the idea is to train convolutional neural network that regress the new position of any object which is positioned in the center of the previous frame. Such network is trained using a collection of videos and images as bounding box labels. For the target video, it is applied directly without fine-tuning on the target video. By avoiding fine-tuning, this network can reach 100 frames per second if GPU is available. The architecture of this network is similar to the patch matching networks used for stereo matching and optical flow estimation. But much larger images are taken as input. First, both images are parsed through a series of convolutional layers. Then, outputs are concatenated and parsed through a set to fully connected layers. The output to the net is bounding box of object in the second frame. Such network can be trained using videos and separate images with annotated object of interest. In the last case, we have only one image. So, to simulate second frame, we apply a random shift and transformation to the input image. During mini-match construction, first, we randomly select video, then we randomly select pair of frames. From each pair of frames, several crops are extracted to augment dataset with additional examples. The experimental evaluation on the vogue challenge has demonstrated that the GOTURN outperforms many state-of-the-art visual trackers in term of overall rank, measured as an energy between accuracy rank and robustness rank. As a conclusion, I want to repeat that both methods are used for short term tracking on arbitrary object. As usual, we seek balance between robustness and speed. Currently, the most accurate and robust methods are very slow, one frame per second or slower. Best performing methods combine various image features or based on convolutional neural networks. The latter tends to outperform all other approaches but require GPU acceleration. If no GPU is available, then non-CNN methods should be used and accuracy should be sacrificed.