In this video, I will introduce the visual object tracking problem. Consider a video with some moving objects in it. It can be captured by static or moving camera. Object tracking is the process of locating and moving object or multiple objects over time in the video. An output of object tracking in the object track. It is the sequence of object locations in each frame of a video. Visual object tracking considers a problem of tracking of a single object in the video. Object to track is specified in the first frame of the video. We don't know anything about an object except its location in the first frame, thus, object tracking is model-free tracking. We can build the detector to detect this object in other images. Because object appearance is changing over time, we consider only short-term tracking for this visual tracking problem in a short video sequences. Only previous frames can be used for visual object tracking method, no future glimpses. There are several challenges in developing a visual tracking methods. First is a computational load. For each second of a video, we need to process a whole lot of N frames. Second is the appearance change of the object over time. The appearance can change due to the object dynamics, viewpoint change, lighting change, and other reasons. Third is object interaction in video. Other object can occlude the object of interest. The object appearance can be similar to other object. In this case, you have to distinguish between these different objects. Preparation of ground-truth data for visual object tracking is also a complicated process, but it's easier than the optical flow estimation because human operator can usually track object rather well, if not in real time. But it is significantly harder compared to image classification and object detection. Because one example for object detection is one image and one example for object tracking is one video. There are a number of datasets, but the amount of data is still limited, especially if we want to train CNN models from scratch. Currently, the best and main dataset for evaluation of visual object trackers is a visual object tracker challenge. This evaluation is based on several key ideas. First, open source implementations should be available for all methods. Matlab-toolkit is provided for algorithm evaluation. Both accuracy and speed are evaluated. The datasets should be small but diverse. And it should consist only of short videos with 100 frames because visual object tracking is short term in general. The idea behind selection of videos for this dataset is to collect the large number of videos first, then to cluster them into groups of similar videos, and then to select best representative from each class, the top 10 small but diverse set of examples. In order to reduce human annotation errors, the two-step annotation procedure is used. First, target object region is segmented by semi-automatic image segmentation methods. Then a bounding box is fit automatically by optimizing the cost function. For each video frame, a number of attributes are marked. It includes object occlusion, object motion, object size change, illumination change, and camera motion. This allow to evaluate the performance of algorithm in various cases. To evaluate the accuracy of tracking, we can compute the average overlap during successful tracking between ground-truth box and predicted bounding box. The measure of algorithm or robustness is a number of times a tracker drifts off target. In visual object tracking challenge, the expected average overlap is used to merge accuracy and robustness of tracker into one metric. When tracker overlap reach zero, it is re-initialized and tracking is continued. So, when tracking drifts off target, it penalize the tracking accuracy. So, we can measure both robustness and accuracy in one measure. As a metric for tracking speed, the equivalent filter operation is used. The idea is to reduce the hardware bias by reports in tracking speed relative to the time required to perform filter operation. As a reference, a MAX filter is used. MAX filter is applied in 30 by 30 window for all pixels in 600 by 600 pixel image. The time of tracking is divided by the time of MAX filter. In 2016, almost 70 methods were evaluated in this challenge. The top performing method in term of accuracy were all based on Convolutional Neural Networks. Other methods are based on elaborate correlation filters, which also show good performance, but a little worse than Convolutional Neural Networks. But in terms of speed, the top performing methods were the slowest methods. It is interesting to see the most challenging and the least challenging example in this challenge. The most challenging examples includes tracking of a person head in the matrix, white rabbit tracking in white snow, the butterfly tracking in the flowers. The least challenging are the tracking of a singer in white dress in the dark scene, the octopus on the sandy sea floor, and the sheep in the herd.