Single object tracking (SOT) is one of the fundamental problems in computer vision, which has received extensive attention from scholars and industry professionals worldwide due to its important applications in intelligent video surveillance, human-computer interaction, autonomous driving, military target analysis, and other fields. For a given video sequence, a SOT method needs to predict the real-time and accurate location and size of the target in subsequent frames based on the initial state of the target (usually represented by the target bounding box) in the first frame. Unlike object detection, the tracking target in the tracking task is not specified by any specific category, and the tracking scene is always complex and diverse, involving many challenges such as changes in target scales, target occlusion, motion blur, and target disappearance. Therefore, tracking targets in real-time, accurately, and robustly is an extremely challenging task.
The mainstream object tracking methods can be divided into three categories: discriminative correlation filters-based tracking methods, Siamese network-based tracking methods, and Transformer-based tracking methods. Among them, the accuracy and robustness of discirminative correlation filter (DCF) are far below the actual requirements. Meanwhile, with the advancement of deep learning hardware, the advantage of DCF methods being able to run in real time on mobile devices no longer exists. On the contrary, deep learning techniques have rapidly developed in recent years with the continuous improvement of computer performance and dataset capacity. Among them, deep learning theory, deep backbone networks, attention mechanisms, and self-supervised learning techniques have played a powerful role in the development of object tracking methods. Deep learning-based SOT methods can make full use of large-scale datasets for end-to-end offline training to achieve real-time, accurate, and robust tracking. Therefore, we provide an overview of deep learning-based object tracking methods.
Some review works on tracking methods already exist, but the presentation of Transformer-based tracking methods is absent. Therefore, based on the existing work, we introduce the latest achievements in the field. Meanwhile, in contrast to the existing work, we innovatively divide tracking methods into two categories according to the type of architecture, i.e., Siamese network-based two-stream tracking method and Transformer-based one-stream tracking method. We also provide a comprehensive and detailed analysis of these two basic architectures, focusing on their principles, components, limitations, and development directions. In addition, the dataset is the cornerstone of the method training and evaluation. We summarize the current mainstream deep learning-based SOT datasets, elaborate on the evaluation methods and evaluation metrics of tracking methods on the datasets, and summarize the performance of various methods on the datasets. Finally, we analyze the future development trend of video target tracking methods from a macro perspective, so as to provide a reference for researchers.
Deep learning-based target tracking methods can be divided into two categories according to the architecture type, namely the Siamese network-based two-stream tracking method and the Transformer-based one-stream tracking method. The essential difference between the two architectures is that the two-stream method uses a Siamese network-shaped backbone network for feature extraction and a separate feature fusion module for feature fusion, while the one-stream method uses a single-stream backbone network for both feature extraction and fusion.
The Siamese network-based two-stream tracking method constructs the tracking task as a similarity matching problem between the target template and the search region, consisting of three basic modules: feature extraction, feature fusion, and tracking head. The method process is as follows: The weight-shared two-stream backbone network extracts the features of the target template and the search region respectively. The two features are fused for information interaction and input to the tracking head to output the target position. In the subsequent improvements of the method, the feature extraction module is from shallow to deep; the feature fusion module is from coarse to fine, and the tracking head module is from complex to simple. In addition, the performance of the method in complex backgrounds is gradually improved.
The Transformer-based one-stream tracking method first splits and flattens the target template and search frame into sequences of patches. These patches of features are embedded with learnable position embedding and fed into a Transformer backbone network, which allows feature extraction and feature fusion at the same time. The feature fusion operation continues throughout the backbone network, resulting in a network that outputs the target-specified search features. Compared with two-stream networks, one-stream networks are simple in structure and do not require prior knowledge about the task. This task-independent network facilitates the construction of general-purpose neural network architectures for multiple tasks. Meanwhile, the pre-training technique further improves the performance of the one-stream method. Experimental results demonstrate that the pre-trained model based on masked image modeling optimizes the method.
One-stream tracking method with a simple structure and powerful learning and modeling capability is the trend of future target tracking method research. Meanwhile, collaborative multi-task tracking, multi-modal tracking, scenario-specific target tracking, unsupervised target tracking methods, etc. have strong applications and demands.
In recent years, advancements in computing software and hardware have led to artificial intelligent (AI) models achieving performance levels approaching or surpassing human capabilities in perceptive tasks. However, in order to develop mature AI systems that can comprehensively understand the world, models must be capable of generating visual concepts, rather than simply recognizing them because creation and customization require a thorough understanding of high-level semantics and full details of each generated object.
From an applied perspective, when AI models obtain the capability of visual understanding and generation, they will significantly promote progress and development across diverse aspects of the industry. For example, visual generative models can be applied to the following aspects: colorizing and restoring old black and white photos and films; enhancing and remastering old videos in high definition; synthesizing real-time virtual anchors, talking faces, and AI avatars; incorporating special effects into personalized video shooting on short video platforms; stylizing users' portraits and input images; compositing movie special effects and scene rendering, and so on. Therefore, research on the theories and methods of image and video generation models holds significant theoretical significance and industrial application value.
In this paper, we first provide a comprehensive overview of existing generative frameworks, including generative adversarial networks (GAN), variational autoencoders (VAE), flow models, and diffusion models, which can be summarized in Fig. 5. GAN is trained in an adversarial manner to obtain an ideal generator, with the mutual competition of a generator and a discriminator. VAE is composed of an encoder and a decoder, and it is trained via variational inference to make the decoded distribution approximate the real distribution. The flow model uses a family of invertible mappings and simple priors to construct an invertible transformation between real data distribution and the prior distribution. Different from GANs and VAEs, flow models are trained by the estimation of maximum likelihood. Recently, diffusion models emerge as a class of powerful visual generative models with state-of-the-art synthesis results on visual data. The diffusion model decomposes the image generation process into a sequence of denoising processes from a Gaussian prior. Its training procedure is more stable by avoiding the use of an adversarial training strategy and can be successfully deployed in a large-scale pre-trained generation system.
We then review recent state-of-the-art advances in image and video generation and discuss their merits and limitations. Fig. 6 shows the overview of image and video generation models and their classifications. Works on pre-trained text-to-image generation models study how to pre-train a text-to-image foundation model on large-scale datasets. Among those T2I foundation models, stable diffusion becomes a widely-used backbone for the tasks of image/video customization and editing, due to its impressive performance and scalability. Prompt-based image editing methods aim to use the pre-trained text-to-image foundation model, e.g., stable diffusion, to edit a generated/natural image according to input text prompts. Due to the difficulty of collecting large-scale and high-quality video datasets and the expensive computational cost, the research on video generation still lags behind image generation. To learn from the success of text-to-image diffusion models, some works, e.g., video diffusion model, imagen video, VIDM, and PVDM, have tried to use enormous video data to train a video diffusion model from scratch and obtain a video generation foundation model similar to stable diffusion. Another line of work aims to resort to pre-trained image generators, e.g., stable diffusion, to provide content prior to video generation and only learn the temporal dynamics from video, which significantly improves the training efficiency.
Finally, we discuss the drawbacks of existing image and video generative modeling methods, such as misalignment between input prompts and generated images/videos, further propose feasible strategies to improve those visual generative models, and outline potential and promising future research directions. These contributions are crucial for advancing the field of visual generative modeling and realizing the full potential of AI systems in generating visual concepts.
Under the rapid evolution of diffusion models, artificial intelligence has undergone a significant transformation from perception to creation. AI can now generate perceptually realistic and harmonious data, even allowing visual customization and editing based on input conditions. In light of this progress in generative models, here we provide prospects for the potential future forms of AI: with both perception and cognitive abilities, AI models can establish their own open world, enabling people to realize the concept of "what they think is what they get" without being constrained by real-life conditions. For example, in this open environment, the training of AI models is no longer restricted by data collection, leading to a reformation of many existing paradigms in machine learning. Techniques like transfer learning (domain adaptation) and active learning may diminish in importance. AI might be able to achieve self-interaction, self-learning, and self-improvement within the open world it creates, ultimately attaining higher levels of intelligence and profoundly transforming humans' lifestyles.