Robust visual tracking using temporal regularization correlation filter with high-confidence strategy

Xiao-Gang Dong; Ke-Xuan Li; Hong-Xia Mao; Chen Hu; Tian Pu

doi:10.1016/j.jnlest.2025.100314

Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.

Abstract

Target tracking is an essential task in contemporary computer vision applications. However, its effectiveness is susceptible to model drift, due to the different appearances of targets, which often compromises tracking robustness and precision. In this paper, a universally applicable method based on correlation filters is introduced to mitigate model drift in complex scenarios. It employs temporal-confidence samples as a priori to guide the model update process and ensure its precision and consistency over a long period. An improved update mechanism based on the peak side-lobe to peak correlation energy (PSPCE) criterion is proposed, which selects high-confidence samples along the temporal dimension to update temporal-confidence samples. Extensive experiments on various benchmarks demonstrate that the proposed method achieves a competitive performance compared with the state-of-the-art methods. Especially when the target appearance changes significantly, our method is more robust and can achieve a balance between precision and speed. Specifically, on the object tracking benchmark (OTB-100) dataset, compared to the baseline, the tracking precision of our model improves by 8.8%, 8.8%, 5.1%, 5.6%, and 6.9% for background clutter, deformation, occlusion, rotation, and illumination variation, respectively. The results indicate that this proposed method can significantly enhance the robustness and precision of target tracking in dynamic and challenging environments, offering a reliable solution for applications such as real-time monitoring, autonomous driving, and precision guidance.

Keywords

Appearance changes Correlation filter High-confidence strategy Temporal regularization Visual tracking

1 Introduction

Target tracking is an important research area in computer vision [1]. It is widely used in robotics [2], security surveillance [3], visual navigation [4], precision guidance [5], and other areas [6,7]. Among the current tracking methods, those based on correlation filters (CFs) are especially intriguing, due to their high efficiency and broad usability [8,9]. In 2010, Bolme et al. [10] proposed the first CF-based tracking method by utilizing the minimum output sum of squared error (MOSSE) filter. Later, two representative methods were successively reported by Henriques et al. [11,12]. One is based on the circulant structure of tracking-by-detection with kernels (CSK) that exploits large amounts of cyclically shifted samples for learning [11]. Another is based on a kernelized correlation filter (KCF) by adding a kernel mechanism [12]. These studies largely provoke the development of CF-based methods. However, early CF-based methods mainly adopt a cyclic sampling process, where unrealistic samples are periodically taken [11,12]. Thus, they are apt to produce boundary effects. To overcome such limitation, Danelljan et al. [13] proposed an algorithm based on the spatially regularized discriminative correlation filters (SRDCF) and utilized the spatial regularization term to constrain the filter, in order to fit the actual target. Galoogahi et al. [14] proposed background-aware correlation filters (BACF) based on the idea that background redundancy information in the correlation filters with limited boundaries (CFLB) [15] can alleviate the boundary effects. However, neither adding spatial regularization nor adding more background redundancy information can prevent model drift when the target appearance changes significantly, eventually leading to tracking failure. This model drift mainly originates from the following two aspects: i) Only the information in the current frame is considered. When the target appearance changes significantly, the update direction of the filter cannot be effectively constrained. ii) There is no validation for filter model updates. For example, when a tracking target is occluded, the occlusion may be misrecognized as the target, and updating the filter parameters will lead to identifying the wrong target.

To alleviate the model drift problem, introducing the passive-aggressive (PA) idea into SRDCF to constrain the update filter has been widely demonstrated to be viable [16,17]. PA is a method to enhance the robustness of the model by introducing adversarial perturbations. Its core is to optimize the model update process by simulating potential attacks, thereby improving the model’s adaptability to the diverse targets with different appearances. However, it is assumed simply based on the prior information from the previous frame, whose efficacy is unreliable. For example, when the tracking target is occluded, the spatial-temporal regularized correlation filters (STRCF) [16] will continue to use the previous occlusion frame and update CF with the wrong samples. As a result, the updating errors are gradually accumulated during the tracking process, finally resulting in tracking failure. This is even worse than SRDCF without prior information. Based on SRDCF, Danelljan et al. [17] further optimized the spatially regularized discriminative correlation filter with decontamination (SRDCFdecon) by integrating a multi-sample weighted update strategy to effectively mitigate model drift. However, the calculation of SRDCF is complicated due to the additional multiple optimization processes, which dims the high-speed advantage of CFs. Although great achievements have been made in recent years, a filter that can achieve acceptable precision and robustness simultaneously is still desirable, especially when the target changes significantly.

As an alternative, BACF is attractive due to its high efficiency of multi-channel parallel computation. Moreover, it can alleviate boundary effects by sampling with redundant background information. However, it does not take temporal information into consideration, leading to errors in selecting temporal-confidence samples. To address this issue, this paper proposes a target-tracking method based on the temporal regularization CF, named temporal-regulation background-aware correlation filter (TBACF), to address model drift caused by significant changes in the target appearance. TBACF adopts BACF as a baseline and simultaneously integrates the temporal regularization term and the peak side-lobe to peak correlation energy (PSPCE)-based updating strategy. Due to the convexity of the temporal regularization term, the alternating direction method of multipliers (ADMM) [18] is used to improve the computational efficiency of TBACF. The contributions of this paper are as follows:

• A new temporal regularization term is proposed for CF-based tracking methods, based on temporal-confidence samples generated by high-confidence samples selected along the time axis. Such samples ensure the precision of filter updates in the tracking process, thus eliminating the negative influence of variations in the target appearance on the model performance.

• An innovative update strategy derived from the PSPCE criterion is introduced, which is able to effectively select high-confidence samples over time, update the temporal-confidence samples, and maintain the integrity of the tracking process.

• By comparing with some state-of-the-art (SOTA) methods, it has been demonstrated that the proposed method is superior in precision, robustness, and speed and exhibits a strong adaptability to broad applications. It means that the proposed temporal regularization term and update strategy are feasible to be integrated into other baseline models to improve their performance.

2 Related work

Despite various visual tracking methods have been reported, CF-based methods are highly attractive for their high efficiency and precision. MOSSE is the first method, which applies CF to the tracking task [10]. Subsequently, CSK simplifies the computation of filter parameters in the frequency domain and significantly improves the computational efficiency by introducing a kernel mechanism and cyclic matrices [11]. Based on CSK, KCF further improves the tracking precision by leveraging the histogram of oriented gradients (HOG) features for filter training [12].

On the basis of these priori experiences, different advanced tracking techniques have been realized to conquer different aspects of the challenge. For instance, Bertinetto et al. [19] reported a method named the sum of template and pixel-wise learners (Staple), which integrates template learning with pixel-wise learners and incorporates color histograms to improve tracking robustness. It is high-efficient and particularly desirable in environments where color plays a crucial role, such as wildlife monitoring and urban surveillance. Using a scale pool, Li and Zhu [18] introduced scale adaptation to improve the tracking performance across various image scales and named this method as scale adaptive with multiple features (SAMF). SAMF is crucial for applications, like traffic monitoring, where vehicles may appear at different distances from the camera. Danelljan et al. [20] realized a discriminative scale space tracker (DSST), which offers an advanced approach to managing the scale by using different filters for the position and scale. It is suitable for fine-grained size adjustments in retail and crowd monitoring.

Currently, the performance of tracking systems used in dynamic environments like public spaces has been significantly improved by innovative methods. For example, the discriminative CF with channel and spatial reliability (CSR-DCF) [21] employs spatially constrained masks to address boundary effects and SRDCF [13] expands the feature learning area through spatial regularization. Both of them can enhance the system’s robustness. Deep learning technology has also been introduced into CF-based tracking methods. Replacing traditional HOG features with convolutional layer outputs, the tracking performance of deep spatially regularized discriminative correlation filters (DeepSRDCF) [13] in complex scenarios, such as security surveillance, has been largely boosted. By using multi-layer convolutional features, the hierarchical convolutional feature (HCF) [22] is able to improve the efficiency of CF-based methods for diverse applications, such as automated manufacturing and sports analytics.

Focusing on specific challenges in modern tracking tasks, such as tracking objects in drastically changing or occluded environments, Cai et al. [23] proposed the multi-object tracking with memory (MeMOT) strategy, which employs spatiotemporal memory to improve the tracking performance. Qin et al. [24] proposed MotionTrack for multi-object tracking by using a strategy of scale pool. Hoorick et al. [25] proposed an effective strategy named containers and occluders in the wild (TCOW) to reinforce tracking capabilities. Ren et al. [26] explored the fine-grained object representations through a combination of a flow alignment feature pyramid network (FAFPN), multi-head part mask generator, and shuffle-group sampling strategy, and demonstrated its superior performance on benchmark datasets.

However, the above-mentioned methods suffer from model drift when there are significant changes in the target appearance. To conquer this problem, Ma et al. [27] introduced a long-term correlation tracking (LCT) algorithm, which incorporates spatiotemporal context to maintain the precision in background clutter and occlusions. Similarly, SRDCFdecon [17] integrates a strategy of weighting multiple frames based on the calculated loss function, which can effectively mitigate model drift. However, both of them exhibit a high computational load, and especially under the situation that the target appearance changes significantly, they still face certain limitations. Despite STRCF can stabilize filter changes over time by using previous frame data in current updates [16], cumulative errors will be generated under the condition of long-time occlusions, leading to aiming at incorrect targets. In summary, although existing methods have alleviated the problem of model drift to some extent, how to enhance the robustness and precision of models while maintaining low computational complexity when the target appearance undergoes significant changes remains a crucial issue that urgently needs to be addressed in this field.

3 Temporal-regulation background-aware correlation filter

In this section, BACF is first introduced, following with the principle of the proposed TBACF with temporal regularization terms and introduction of the optimization algorithm, and finally the ADMM-based solution of TBACF is stated.

3.1 Background-aware correlation filter

BACF is a common method used to mitigate boundary effects encountered in CF approaches. The objective function for BACF is defined as

$ E \left( {\boldsymbol{h}} \right) = \frac{1}{2}\sum\limits_{j = 1}^M {\left\| {{\boldsymbol{y}} \left( j \right) - \sum\limits_{k = 1}^K {{{\boldsymbol{h}}^{\left( k \right){\text{T}}}}{\boldsymbol{P}}{{\boldsymbol{x}}^{\left( k \right)}} \left( j \right)} } \right\|_2^2 + \frac{\lambda }{2}} \sum\limits_{k = 1}^K {\left\| {{{\boldsymbol{h}}^{\left( k \right)}}} \right\|_2^2} $ (1)

where $ {\Vert \cdot \Vert }_{2} $ represents the $ {l_2} $-norm of any variable; $ {\boldsymbol{y}} \left( j \right) \in {\mathbb{R}^M} $ is the correlation output with a peak centered upon the target of interest, and $ {\boldsymbol{y}} \left( j \right) $ is the j-th element of $ {\boldsymbol{y}} $; $ {\boldsymbol{h}} $ denotes CF, whose parameters could be obtained by minimizing $ E({\boldsymbol{h}}) $; $ {{\boldsymbol{x}}^{\left( k \right)}} \in {\mathbb{R}^M} $ and $ {{\boldsymbol{h}}^{\left( k \right)}} \in {\mathbb{R}^D} $ refer to the k-th channels of the vectorized image and filter on the real number field $\mathbb{R}$, respectively; ${\boldsymbol{P}}$ is a $ D \times M $ binary matrix and extracts a region sized $ D $ from $ {{\boldsymbol{x}}^{\left( k \right)}} $; $ M $ and $ D $ denote the numbers of elements in the vectorized input and filter, respectively; $ K $ is the number of feature channels; $ \lambda $ is a parameter used to adjust the influence of the regularization term $ \dfrac{\lambda }{2}\sum\limits_{k = 1}^K {\left\| {{{\boldsymbol{h}}^{\left( k \right)}}} \right\|_2^2} $, which controls the magnitude of filter weights $ {{\boldsymbol{h}}^{\left( k \right)}} $; $ {\boldsymbol{P}}{{\boldsymbol{x}}^{\left( k \right)}} \left( j \right) $ returns all possible image patches with the size of $ D $; the operator $ {\text{T}} $ denotes the conjugate transpose.

Obviously, (1) only considers the information of the current frame during the tracking process and does not take the prior temporal information to constrain the filter update when the target appearance changes significantly.

3.2 Principle of temporal-regulation background-aware correlation filter

To eliminate the negative effects of drastic changes between neighboring frames, TBACF is proposed in this paper. It can ensure the precision of the current tracking target as much as possible and simultaneously update the filter to be similar to the previous one. Inspired by PA [28], a temporal regularization term is proposed based on a temporal-confidence sample, namely $\left\| {{{\boldsymbol{h}}^{\left( k \right)}} - {\boldsymbol{h}}_{{\text{tc}}}^{\left( k \right)}} \right\|_2^2$. Therefore, TBACF can be written as

$ E \left( {\boldsymbol{h}} \right) = \frac{1}{2}\sum\limits_{j = 1}^M {\left\| {{\boldsymbol{y}} \left( j \right) - \sum\limits_{k = 1}^K {{{\boldsymbol{h}}^{\left( k \right){\text{T}}}}{\boldsymbol{P}}{{\boldsymbol{x}}^{\left( k \right)}} \left( j \right)} } \right\|_2^2 + \frac{\gamma }{2}} \sum\limits_{k = 1}^K {\left\| {{{\boldsymbol{h}}^{\left( k \right)}} - {\boldsymbol{h}}_{{\text{tc}}}^{\left( k \right)}} \right\|_2^2 + \frac{\lambda }{2}\sum\limits_{k = 1}^K {\left\| {{{\boldsymbol{h}}^{\left( k \right)}}} \right\|_2^2} } $ (2)

where ${{\boldsymbol{h}}_{{\text{tc}}}}$ refers to the temporal-confidence sample and $ \gamma $ refers to the regulation parameter.

The system function in (2) is set as $ \Cambriabifont\text{θ} $, and the loss function in the form of least squares is ${\left\| {{\boldsymbol{X}}\Cambriabifont\text{θ} - {\boldsymbol{y}}} \right\|^2}$, where ${\boldsymbol{X}}$ is the input feature matrix of $ {\boldsymbol{x}} $. The gradient descent is used to solve $ {\Cambriabifont\text{θ}} = {\left( {{{\boldsymbol{X}}^{\rm{T}}}{\boldsymbol{X}}} \right)^{ - 1}}{{\boldsymbol{X}}^{\rm{T}}}{\boldsymbol{y}} $. However, when ${\boldsymbol{X}}$ is not full rank, ${{\boldsymbol{X}}^{\text{T}}}{\boldsymbol{X}}$ is close to singular, the optimization function becomes an ill-posed problem, and its solution lacks stability and reliability.

It is obvious that the essential difference between TBACF and BACF is the temporal regularization term $\left\| {{{\boldsymbol{h}}^{\left( k \right)}} - {\boldsymbol{h}}_{{\text{tc}}}^{\left( k \right)}} \right\|_2^2$. This term enables the updating direction of CF to be close to the previous high-confidence sample. Therefore, model drift caused by significant changes in target appearance can be mitigated. In addition, the PSPCE-based updating strategy is introduced in TBACF to update the temporal-confidence sample, thus ensuring the precision of the temporal regularization term.

3.3 Optimization algorithm

Solving the objective function of CF is usually performed in the frequency domain [12,14]. Thus, (2) is further converted into a frequency domain, expressed as

$ E\left( {{\boldsymbol{h}}{\mathrm{,}}{\text{ }}\hat{\boldsymbol{g}}} \right) = \frac{1}{2}\left\| {\hat{\boldsymbol{y}} - \hat{\boldsymbol{X\hat g}}} \right\|_2^2 + \frac{\lambda }{2}\left\| {\boldsymbol{h}} \right\|_2^2 + \frac{\gamma }{2}\left\| {{\boldsymbol{h}} - {{\boldsymbol{h}}_{{\text{tc}}}}} \right\|_2^2 \;\;\;{\text{ s}}{\text{.t}}{\text{. }}\;\;\;{{\text{ }}^{}}\hat{\boldsymbol{g}} = \sqrt M \left( {{\boldsymbol{F}}{{\boldsymbol{P}}^{\text{T}}} \otimes {{\boldsymbol{I}}_K}} \right){\boldsymbol{h}} $ (3)

where the symbol $ ^ \wedge $ denotes the discrete Fourier transform of a given signal, like $\hat{\boldsymbol{b}} = \sqrt M {\boldsymbol{Fb}}$, where ${\boldsymbol{F}}$ is an orthonormal $ M \times M $ matrix of complex basis vectors utilized to map any M-dimensional vectorized signal into the Fourier domain; $\hat{\boldsymbol{g}}$ is an auxiliary variable; ${\boldsymbol{h}}$,$\hat{\boldsymbol{g}}$, and ${{\boldsymbol{h}}_{{\text{tc}}}}$ are defined as ${\boldsymbol{h}} = {\left[ {{{\boldsymbol{h}}^{\left( 1 \right){\text{T}}}}{\mathrm{,}}{\text{ }}{{\boldsymbol{h}}^{\left( 2 \right){\text{T}}}}{\mathrm{,}}{\text{ }} \cdots {\mathrm{,}}{\text{ }}{{\boldsymbol{h}}^{\left( K \right){\text{T}}}}} \right]^{\text{T}}}$, $ \hat{\boldsymbol{g}}= \left[ \text{conj}(\hat{\boldsymbol{g}}^{\left(1\right)\text{T}})\mathrm{,}\text{ conj}(\hat{\boldsymbol{g}}^{\left(2\right)\text{T}})\mathrm{,}\text{ }\cdots\mathrm{,}\text{ conj}(\hat{\boldsymbol{g}}^{\left(K\right)\text{T}}) \right]^{\text{T}} $ with $ \text{conj(} \cdot \text{)} $ being the conjugate to a vector, and ${{\boldsymbol{h}}_{{\text{tc}}}} = {\left[ {{\boldsymbol{h}}_{{\text{tc}}}^{(1){\text{T}}}{\mathrm{,}}{\text{ }}{\boldsymbol{h}}_{{\text{tc}}}^{(2){\text{T}}}{\mathrm{,}}{\text{ }} \cdots {\mathrm{,}}{\text{ }}{\boldsymbol{h}}_{{\text{tc}}}^{(K){\text{T}}}} \right]^{\text{T}}}$, which all are generated by concatenating their K corresponding vectorized channels, and have dimensions of $ KD \times 1 $, $ KM \times 1 $, and $ KD \times 1 $, respectively; ${{\boldsymbol{I}}_K}$ is a K-ranked unit matrix with the size of $ M \times KM $; $ \otimes $ represents the Kronecker product; $\hat{\boldsymbol{X}} = \left[ {{\text{diag}}{{({{\hat{\boldsymbol{x}}}_1})}^{\text{T}}}{\mathrm{,}}{\text{ diag}}{{({{\hat{\boldsymbol{x}}}_2})}^{\text{T}}}{\mathrm{,}}{\text{ }} \cdots {\mathrm{,}}{\text{ diag}}{{({{\hat{\boldsymbol{x}}}_K})}^{\text{T}}}} \right]$. The model in (3) is convex and can be minimized to obtain a locally optimal solution using ADMM. The augmented Lagrangian form of (3) can be formulated as

$ L\left( {{\boldsymbol{h}}{\mathrm{,}}{\text{ }}\hat{\boldsymbol{g}}{\mathrm{,}}{\text{ }}\hat{\Cambriabifont\text{ξ}}} \right) = \frac{1}{2}\left\| {\hat{\boldsymbol{y}} - \hat{\boldsymbol{X\hat g}}} \right\|_2^2 + \frac{\lambda }{2}\left\| {\boldsymbol{h}} \right\|_2^2 + \frac{\gamma }{2}\left\| {{\boldsymbol{h}} - {{\boldsymbol{h}}_{{\text{tc}}}}} \right\|_2^2 + {\hat{\Cambriabifont\text{ξ}}^{\text{T}}}\left( {\hat{\boldsymbol{g}} - \sqrt M \left( {{\boldsymbol{F}}{{\boldsymbol{P}}^{\rm T}} \otimes {{\boldsymbol{I}}_K}} \right){\boldsymbol{h}}} \right) + \frac{\mu }{2}\left\| {\hat{\boldsymbol{g}} - \sqrt M \left( {{\boldsymbol{F}}{{\boldsymbol{P}}^{\text{T}}} \otimes {{\boldsymbol{I}}_K}} \right){\boldsymbol{h}}} \right\|_2^2 $ (4)

where $ \mu $ is the penalty factor and $ \hat{\Cambriabifont\text{ξ}} = {\left[ {\hat{\Cambriabifont\text{ξ}}_1^{\text{T}}{\mathrm{,}}{\text{ }}\hat{\Cambriabifont\text{ξ}}_2^{\text{T}}{\mathrm{,}}{\text{ }} \cdots {\mathrm{,}}{\text{ }}\hat{\Cambriabifont\text{ξ}}_K^{\text{T}}} \right]^{\text{T}}} $ is the Lagrangian vector in the Fourier domain. Then ADMM is adopted by alternately solving the following two subproblems (5a) and (5b):

\begin{aligned} {\boldsymbol h}^{*} & = \arg \min_{\boldsymbol h} {\frac{λ}{2} {‖ \boldsymbol h ‖}_{2}^{2} + \frac{γ}{2} {‖ \boldsymbol h - {\boldsymbol h}_{tc} ‖}_{2}^{2} + {\hat{\Cambriabifont ξ}}^{T} (\hat{\boldsymbol g} - \sqrt{M} (\boldsymbol F {\boldsymbol P}^{T} \otimes {\boldsymbol I}_{K}) \boldsymbol h) + \frac{μ}{2} {‖ \hat{\boldsymbol g} - \sqrt{M} (\boldsymbol F {\boldsymbol P}^{T} \otimes {\boldsymbol I}_{K}) \boldsymbol h ‖}_{2}^{2}} \\ = {(\frac{M μ + λ + γ}{M})}^{- 1} [\frac{\boldsymbol F {\boldsymbol P}^{T} \otimes {\boldsymbol I}_{K}}{\sqrt{M}} (μ \hat{\boldsymbol g} + \hat{\Cambriabifont ξ}) + {\boldsymbol h}_{tc}] \end{aligned}

$ \hat{\boldsymbol{g}}^* = {\mathrm{arg}}\, \mathop {\mathrm{min}} \limits_{\hat{\boldsymbol{g}}} \left\{ \frac{1}{2}\left\| {\hat{\boldsymbol{y}} - \hat{\boldsymbol{X\hat g}}} \right\|_2^2 + {{\hat{\Cambriabifont\text{ξ}}}^{\text{T}}}\left( {\hat{\boldsymbol{g}} - \sqrt M \left( {{\boldsymbol{F}}{{\boldsymbol{P}}^{\rm T}} \otimes {{\boldsymbol{I}}_K}} \right){\boldsymbol{h}}} \right) + \frac{\mu }{2} {\left\| {\hat{\boldsymbol{g}} - \sqrt M \left( {{\boldsymbol{F}}{{\boldsymbol{P}}^{\text{T}}} \otimes {{\boldsymbol{I}}_K}} \right){\boldsymbol{h}}} \right\|_2^2} \right\} $ (5b)

Equation (5) satisfies two conditions for ADMM convergence [29]:

• $\dfrac{1}{2}\left\| {\hat {\boldsymbol{y}} - \hat{\boldsymbol{X\hat g}}} \right\|_2^2$ and $\dfrac{\lambda }{2}\left\| {\boldsymbol{h}} \right\|_2^2 + \dfrac{\gamma }{2}\left\| {{\boldsymbol{h}} - {{\boldsymbol{h}}_{{\text{tc}}}}} \right\|_2^2$ are convex and closed;

• The Lagrange function (non-augmented) ${L_0}{\text{ }}\left( {\mu = 0} \right)$ has at least one saddle point.

For the subproblem (5a), ${\boldsymbol{F}}{{\boldsymbol{P}}^{\text{T}}} \otimes {{\boldsymbol{I}}_K}$ can be decomposed into K independent inverse fast Fourier transform (IFFT) computations of ${{\boldsymbol{g}}_k} = \left( {{1 \mathord{\left/ {\vphantom {1 {\sqrt M }}} \right. } {\sqrt M }}} \right){\boldsymbol{P}}{{\boldsymbol{F}}^{\text{T}}}{\hat{\boldsymbol{g}}_k}$ and ${{\Cambriabifont\text{ξ}}_k} = \left( {{1 \mathord{\left/ {\vphantom {1 {\sqrt M }}} \right. } {\sqrt M }}} \right){\boldsymbol{P}}{{\boldsymbol{F}}^{\text{T}}}{\hat{\Cambriabifont\text{ξ}}_k}$, and the temporal regularization term $\left\| {{{\boldsymbol{h}}^{\left( k \right)}} - {\boldsymbol{h}}_{{\text{tc}}}^{\left( k \right)}} \right\|_2^2$ will not change the original computational complexity. Hence (5a) is bounded by $ O\left( {KM {\mathrm{log}} M} \right) $.

For the subproblem (5b), it has the complexity of $ O({M^3}{K^3}) $, making it unable to achieve real-time tracking. For more efficient calculations, each element of $\hat{\boldsymbol{y}}\left( {\hat{\boldsymbol{y}}(t){\mathrm{,}}{\text{ }}t = 1{\mathrm{,}}{\text{ }}2{\mathrm{,}}{\text{ }} \cdots {\mathrm{,}}{\text{ }}M} \right)$ is dependent only on $K$ values of $ \hat{\boldsymbol{x}} = {\left[ {{{\hat{\boldsymbol{x}}}_1}(t){\mathrm{,}}{\text{ }}{{\hat{\boldsymbol{x}}}_2}(t){\mathrm{,}}{\text{ }} \cdots {\mathrm{,}}{\text{ }}{{\hat{\boldsymbol{x}}}_K}(t)} \right]^{\text{T}}} $ and $ \hat{\boldsymbol{g}}(t)=\left[\text{conj} \left(\hat{\boldsymbol{g}}_1^{\text{T}}(t)\right)\mathrm{,}\text{ conj} \left(\hat{\boldsymbol{g}}_2^{\text{T}}(t)\right)\mathrm{,}\text{ }\cdots\mathrm{,}\text{ conj} \left(\hat{\boldsymbol{g}}_K^{\text{T}}(t)\right)\right]^{\text{T}} $. Therefore, the subproblem (5b) could be divided into solving $\hat{\boldsymbol{g}}(t)^*$:

$ \hat{\boldsymbol{g}}(t)^* = {\mathrm{arg}} \mathop { {\mathrm{min}} }\limits_{\hat{\boldsymbol{g}}(t)} \left\{ {\frac{1}{2}\left\| {\hat{\boldsymbol{y}}(t) - {{\hat{\boldsymbol{x}}}^{\text{T}}}(t)\hat{\boldsymbol{g}}(t)} \right\|_2^2} + {\hat{\Cambriabifont\text{ξ}}^{\text{T}}}(t)\left( {\hat{\boldsymbol{g}}(t) - \hat{\boldsymbol{h}}(t)} \right) + \frac{\mu }{2} {\left\| {\hat{\boldsymbol{g}}(t) - \hat{\boldsymbol{h}}(t)} \right\|_2^2} \right\} $ (6)

where $\hat{\boldsymbol{h}}(t) = \left[ {{{\hat{\boldsymbol{h}}}_1}(t){\mathrm{,}}{\text{ }}{{\hat{\boldsymbol{h}}}_2}(t){\mathrm{,}}{\text{ }} \cdots {\mathrm{,}}{\text{ }}{{\hat{\boldsymbol{h}}}_K}(t)} \right]$ and $ {\hat{\boldsymbol{h}}_k}(t) = \sqrt D {\boldsymbol{F}}{{\boldsymbol{P}}^{\rm T}}{{\boldsymbol{h}}_k}(t) $. By solving (6), $\hat{\boldsymbol{g}}(t)^*$ is obtained as

$ \hat{\boldsymbol{g}}(t)^* = {\left( {\hat{\boldsymbol{x}}(t){{\hat{\boldsymbol{x}}}^{\text{T}}}(t) + M\mu {{\boldsymbol{I}}_k}} \right)^{ - 1}}\left( {\hat{\boldsymbol{y}}(t)\hat{\boldsymbol{x}}(t) - M\hat{\Cambriabifont\text{ξ}}(t) + M\mu \hat{\boldsymbol{h}}(t)} \right) $ (7)

Equation (7) has the complexity of $O(MK{}^3)$, indicating that it is still intractable for real-time tracking. Here the Sherman Morrison formula is used to perform a rapid computation of ${(\hat{\boldsymbol{x}}(t){\hat{\boldsymbol{x}}^{\text{T}}}(t) + M\mu {{\boldsymbol{I}}_k})^{ - 1}}$ [30], stating that ${({\boldsymbol{u}}{{\boldsymbol{v}}^{\text{T}}} + {\boldsymbol{A}})^{ - 1}} = {{\boldsymbol{A}}^{ - 1}} - ({{\boldsymbol{v}}^{\text{T}}}{{\boldsymbol{A}}^{ - 1}}{\boldsymbol{u}}){{\boldsymbol{A}}^{ - 1}}{\boldsymbol{u}}{{\boldsymbol{v}}^{\text{T}}}{{\boldsymbol{A}}^{ - 1}}$, where ${\boldsymbol{A}} = M\mu {{\boldsymbol{I}}_k}$ and ${\boldsymbol{u}} = {\boldsymbol{v}} = \hat{\boldsymbol{x}}(t)$. Thus, we have

$ \hat{\boldsymbol{g}}(t)^* = \frac{1}{\mu }\left( {M\hat{\boldsymbol{y}}(t)\hat{\boldsymbol{x}}(t) - \hat{\Cambriabifont\text{ξ}}(t) + \mu \hat{\boldsymbol{h}}(t)} \right) - \frac{{\hat{\boldsymbol{x}}(t)}}{{\mu \left( {{{\hat{\boldsymbol{x}}}^{\text{T}}}(t)\hat{\boldsymbol{x}}(t) + M\mu } \right)}}\left( {M\hat{\boldsymbol{y}}(t){{\hat{\boldsymbol{x}}}^{\text{T}}}(t)\hat{\boldsymbol{x}}(t) - {{\hat{\boldsymbol{x}}}^{\text{T}}}(t)\hat{\Cambriabifont\text{ξ}} + \mu {{\hat{\boldsymbol{x}}}^{\text{T}}}(t)\hat{\boldsymbol{h}}} \right) $ (8)

The complexity of (8) is reduced to $O(MK)$, which means that real-time tracking is accessible.

The formula for Lagrangian update is as follows:

$ {\hat{\Cambriabifont\text{ξ}}_{i + 1}} = {\hat{\Cambriabifont\text{ξ}}_i} + \mu (\hat{\boldsymbol{g}}_{i + 1}^* + {\boldsymbol{h}}_{i + 1}^*) $ (9)

where $\hat{\boldsymbol{g}}_{i + 1}^*$ and ${\boldsymbol{h}}_{i + 1}^*$ are current solutions to the above subproblems (5a) and (5b) at iteration $ i + 1 $ within the iterative ADMM.

The step size parameter $ \mu $ is updated by $ {\mu _{i + 1}} = {\mathrm{min}} ({\mu _{{\mathrm{max}} }}{\mathrm{,}}\;\rho {\mu _i}) $, where $ \rho $ is the scale factor.

3.4 PSPCE-based update strategy

In large-margin object tracking with circulant feature (LMCF) maps [31], the average peak-to-correlation energy (APCE) coefficient is usually utilized to discriminate whether the current target is reliable. However, the APCE coefficient only considers the maximum fluctuation (the difference between the maximum and minimum values) in the response map, which leads to tracking failure when the target is occluded. Because the APCE coefficient will treat the occlusion as the “real” target, thus causing CF to focus on the misjudged target—occlusion.

To eliminate the negative effect of such ambiguous targets, we introduce the concept of side lobes from the radar field. There usually have two or more lobes in antenna patterns. Amongst, the one with the highest radiation intensity is called the main lobe while the remaining called the side lobes. The width of the main lobe indicates the extent to which the energy radiation is concentrated, so the width of the side lobes should be as small as possible, that is, the energy should be focused on the correct target. As all we know, the peak side lobe ratio (PSLR) represents the ratio of the side lobe value to the maximum value of the main lobe in the signal field, thus it can be utilized to overcome this limitation of ambiguous targets. As a result, we propose an updated strategy for temporal-confidence samples based on PSPCE, which is defined as

$ {\text{PSPCE}} = \frac{{\left\| {{F_{{\rm{max}} }} - {F_{{\rm{min}} }}} \right\|_2^2}}{{\left\| {{F_{{\text{side}}}} - {F_{{\rm{min}} }}} \right\|_2^2}}\frac{{\left\| {{F_{{\rm{max}} }} - {F_{{\rm{min}} }}} \right\|_2^2}}{{\displaystyle\sum\limits_{i \in W{\mathrm{,}}\;j \in H} {{{{{({F_{i{\mathrm{,}}j}} - {F_{{\rm{min}} }})}^2}} \mathord{\left/ {\vphantom {{{{({F_{i{\mathrm{,}}j}} - {F_{{\rm{min}} }})}^2}} {(W \cdot H)}}} \right. } {(W \cdot H)}}} }} $ (10)

where $ W $ and $ H $ denote the width and height of the computational region, respectively; $ {F_{{\rm{max}} }} $, $ {F_{{\rm{min}} }} $, ${F_{{\text{side}}}}$, and $ {F_{i{\mathrm{,}}j}} $ refer to the maximum, minimum, side-lobe value, and any pixel value of the response map, as shown in Fig. 1. PSPCE indicates the fluctuation degree of the response map and the confidence level of the currently detected target. Under normal circumstances, the response map should show a smooth and standard Gaussian response, and the value of PSPCE should be larger. However, when the appearance of the target changes significantly, the response map fluctuates drastically, and PSPCE becomes small. Even though the PSPCE value for each frame must be calculated to determine whether the model and temporal-regular term should be updated, the overall complexity in terms of time is not increased. Because the complexity of (10) is lower than that of ADMM. However, another deficiency raised. When the objects have similar features with the tracking target, PSPCE is possible to be misjudged. Thus, we propose a PSPCE-based update strategy: PSPCE that extracts the current frame response map is defined as PSPCE₁; then, once PSPCE₁ is larger than the predefined threshold ${T_{{\text{threshold}}}}$, the grayscale values of the 4-pixel area around the corresponding point of $ {F_{{\rm{max}} }} $ is set to zero, and the recalculated value is recorded as PSPCE₂; consequently, if PSPCE₁/PSPCE₂ is larger than another predefined threshold ${T_{{\text{ratio}}}}$, the tracking result in the current frame can be considered high confidence. It should be noted that the recalculation of PSPCE₁/PSPCE₂ is to prevent the mistaken update induced by the bimodal response map from background clutter and other conditions.

Figure 1.Tracking results of TBACF in different frames: (a) normal frame and (d) corresponding frame with a significant change in the appearance, where the green rectangle is the tracking result of TBACF; (b) and (e) response maps after the green rectangles in (a) and (d) passing through the filter; (c) and (f) response maps after suppressing the maximum value and its surroundings.

Similarly, when the tracking target is occluded by something with similar features to the targets, misjudgment in both PSPCE₁ and PSPCE₂ may arise. To solve this problem, a queue ${{\boldsymbol{h}}_{{\text{que}}}} = \left[ {h_1^{}{\mathrm{,}}{\text{ }}h_2^{}{\mathrm{,}}{\text{ }} \cdots {\mathrm{,}}{\text{ }}h_n^{}} \right]$ is designed to store the last $ n $ temporal-confidence samples with corresponding weight vectors $ {{\boldsymbol{w}}_{{\text{que}}}} = \left[ {w_1^{}{\mathrm{,}}{\text{ }}w_1^{}{\mathrm{,}}{\text{ }} \cdots {\mathrm{,}}{\text{ }}w_n^{}} \right] $ and $ \sum\limits_{i = 1}^n {w_i^{}} = 1$. The temporal samples stored in ${{\boldsymbol{h}}_{{\text{que}}}}$ are temporal-confidence samples. If the current frame is judged to be high confidence by PSPCE, then $ {\boldsymbol{h}} $ in (5a) of the current frame is a temporal-confidence sample, namely, ${\boldsymbol{h}}$ is selected on the time axis, then ${{\boldsymbol{h}}_{{\text{que}}}}$ updates in a first-in-first-out manner. The temporal-confidence sample update method is $ {{\boldsymbol{h}}_{{\text{tc}}}} = {\text{sum}}\left( {{{\boldsymbol{w}}_{{\text{que}}}} \odot {{\boldsymbol{h}}_{{\text{que}}}}} \right) $ with $ \odot $ denoting the element-by-element multiplication. The computational flow of TBACF is shown in Algorithm 1. As shown in Fig. 2(a), when the appearance of the tracking target changes significantly, both PSPCE₁ and PSPCE₂ are at low levels. At this time, the temporal-confidence sample will not be updated. Fig. 2(b) exhibits the influence caused by easily confused targets in blue bounding box during the tracking process. When the tracking target in red bounding box is hidden, the tracker can still produce a larger response value, and the temporal-confidence sample will not be updated in this case. During these two situations, the tracker is easy to concentrate on the wrong target, while the update strategy we proposed can resist the model drift problem caused by this challenge.

Figure 2.Temporal-confidence sample updating: (a) PSPCE variation and (b) ratio of PSPCE₁ and PSPCE₂ during the tracking process.

Table Infomation Is Not Enable

When ADMM is completed, all frames enter the update strategy to obtain the results. The calculation cost resulted by the update strategy is fixed, and the time complexity is $ O(WH) $. Notably, the update strategy is out of ADMM iterations. As a result, compared to ADMM, the time complexity of the update strategy is negligible.

4 Experimental results and discussion

The simulation is conducted by using MATLAB-2018a with an Intel i7-8700 CPU and 16 GB RAM. For object representation, a 31-channel HOG feature is exclusively employed, computed over 4×4 cell grids, and modulated by Hann windows. The system is configured with 5 distinct scales by employing a scaling factor of 1.01 between successive scales. The learning rate of 0.0013 is set and a two-step iteration is used for ADMM. The regularization parameters are empirically chosen as $ \lambda = 0.01 $ and $ \gamma = 1250 $, respectively. The penalty factor $ \mu $ of ADMM is initially set to 1 and updated by $ {\mu _{i + 1}} = {\rm{min}} \left( {{\mu _{{\rm{max}} }}{\mathrm{,}}{\text{ }}\rho {\mu _i}} \right) $, where $ \rho = 10 $ and $ {\mu _{{\rm{max}} }} = 1000 $. In this section, to demonstrate the effectiveness, our method is evaluated on several datasets including temple color 128 (TColor-128) [32], UAV123 [33], visual object tracking challenge 2016 (VOT2016) [34], large-scale single object tracking (LaSOT) [35], and OTB-100 [36].

4.1 OTB-100

The OTB-100 benchmark is a mainstream tracking public dataset with 11 attributes, including but not limited to deformation, rotation, occlusion, background clutter, and illumination variation. The trackers are evaluated by the one pass evaluation (OPE) protocol proposed in Ref. [36], where the overlap precision (OP) metric is used to quantify the proportion of bounding box overlaps in the sequence that exceeds a customized threshold. As a comparison, the SOTA methods are considered, including convolutional neural network-support vector machine (CNN-SVM) [33], HCF [22], hedged deep tracking (HDT) [37], DeepSRDCF [13], efficient convolution operator (ECO) [38], deep learning methods (such as fully-convolutional Siamese networks for object tracking (SiamFC) [39] and Siamese region proposal network (SiamRPN) [40]), and none-deep learning methods (such as multi-expert entropy minimization (MEEM) [41], BACF [14], LCT [27], Staple [19], STRCF [16], KCF [12], SRDCF [13], SAMF [18], SRDCFdecon [17], and DSST [20]).

The experimental results of success rate and precision under the OPE protocol are shown in Fig. 3. Obviously, TBACF is superior than most of the existing methods in terms of both success rate and precision. Especially when compared with these non-deep learning methods, TBACF exhibits the best performance. In addition, the success rate of TBACF is also higher than the majority of other competitors, such as 0.8% higher than ECO, 1.2% higher than SiamRPN, 3.3% higher than STRCF, 4.5% higher than BACF, 6.1% higher than LCT, and 25.0% higher than DSST. It is also worthy to note that an obvious advantage in precision has been achieved by TBACF. Although it is slightly inferior to ECO, its precision is larger than HCF, HDT, and DSST. Especially for DSST, such enhancement is high to 19.1%. These results demonstrate that TBACF is highly competitive among the SOTA methods.

Figure 3.Comparison of TBACF with the SOTA trackers on OTB-100.

For further comparison, another two metrics (AUC and FPS) are adopted, where AUC denotes the area under the curve, i.e., the area under the receiver operating characteristic (ROC) curve, and FPS indicates the frames per second, i.e., the number of image frames that can be processed per second, which is an indicator to measure the video processing speed. As shown in Table 1, a tantalizing AUC result is realized by our proposed TBACF because this value is larger than those of almost all SOTA methods except ECO. With regards to the computational efficiency, TBACF operates at 30.7 FPS. It is highly advantageous to SRDCFdecon and LCT, and also comparable to BACF and DeepSRDCF. Although its speed is slower than SiamRPN, the precision of our proposed TBACF is higher. This indicates TBACF can achieve a balance between precision and speed, endowing it especially attractive in practical applications.

Metric	Method
Metric	ECO [38]	STRCF [16]	SiamRPN [40]	BACF [14]	SRDCFdecon [17]	LCT [27]	DeepSRDCF [13]	TBACF
AUC (%)	69.0	67.6	66.6	64.9	63.9	63.8	63.3	68.0
FPS	10.3	21.1	89.3	35.6	2.5	23.6	1.2	30.7

Table 1. Results of top-8 trackers on OTB-100.

View all Tables

To demonstrate its robustness, the success rates of TBACF and the above-mentioned SOTA methods are also investigated on OTB-100 with the following attributes: In-plane rotation, out-of-plane rotation, deformation, background clutter, and illumination variation, to mimic the circumstances when the target appearance is significantly changed. The obtained results are shown in Fig. 4. It is obvious that TBACF is able to successfully track the target with a success rate of >78.5% under the illumination variation situation. This rate is even higher under other situations. Especially for out-of-plane rotation, deformation, and background clutter, as high as 88.5%, 88.2%, and 91.1% are realized by our proposed TBACF, respectively. This indicates that TBACF has an excellent capacity to working with complex environmental changes and resisting substantial background interference. Under the in-plane rotation situation, although its performance is slightly inferior to SiamRPN, the success rate is still high to 84.9%. All the results obtained with OTB-100 demonstrate the potential of TBACF for tracking applications in complex scenes.

Figure 4.Performance evaluation and comparison on OTB-100 with attributes: (a) and (b) in-plane rotation, (c) and (d) out-of-plane rotation, and (e) and (f) deformation.

4.2 TColor-128

TColor-128 is a benchmark for evaluating how color information affects the performance of the tracker. It contains a total of 128 color video sequences. As a comparison, the SOTA methods are considered, including ECO [38], continuous convolution operator tracker (CCOT) [17], tracking-hand-crafted feature version (ECO-HC) [38], STRCF [16], BACF [14], and DSST [20]. The results shown in Fig. 5 reveal that TBACF has an enhanced performance on TColor-128, compared with DSST and BACF. Especially for DSST, such enhancement is remarkable with a 21.1% higher success rate and 21.3% higher precision. Despite this deviation is reduced between TBACF and BACF, the increased amounts are still high to 7.1% and 10.4%, respectively. Moreover, TBACF is also comparable to efficient convolution operators for ECO-HC with negligible deviations in terms of both success rate and precision. These experimental results indicate that our proposed TBACF can make good use of color features. However, the performance of TBACF is slightly inferior to ECO and CCOT with respectively 5.1% and 1.8% lower in success rate and 5.0% and 3.3% lower in precision. This is attributed to richer feature representation of ECO and CCOT. ECO and CCOT employ more complex feature fusion strategies, which enable them to better capture the multi-scale and multi-directional features of the target. Moreover, more advanced optimization strategies are adopted during the optimization process of ECO and CCOT, which can adapt to the changes in the target appearance at a faster speed, thereby enhancing tracking precision. Therefore, it is possible to further enhance TBACF by integrating multiple types of features to strengthen the target representation capability, and thus improve its success rate and precision.

Figure 5.Performance evaluation and comparison on TColor-128.

4.3 UAV123

The UAV123 benchmark contains 123 fully-annotated high-definition videos captured by professional-grade drones, featuring with viewing-angle changes, smaller targets, etc. Therefore, to evaluate the performance of TBACF on resisting occlusion and adapting to rapidly changing angles, simulations are conducted on UAV123 and also compared with ECO [38], multi-cue correlation filter based tracker (MCCT) [42], CCOT [17], STRCF [16], BACF [14], SRDCF [13], SRDCFdecon [17], hierarchical convolutional feature (CF2) [22], Staple [19], SAMF [18], DSST [20], and KCF [12]. The calculated results in Fig. 6 show that TBACF can perform better on UAV123, despite ECO, MCCT, and CCOT exhibit a tiny superiority. For example, CCOT is 1.0% and 1.8% higher in success rate and precision, respectively. However, its performance enhancement is remarkable with respect to KCF and DSST. Therefore, it can be concluded that TBACF is highly competitive among the existing methods.

Figure 6.Performance evaluation and comparison on UAV123.

4.4 LaSOT

The LaSOT dataset [35] is a recently proposed large-scale database for tracking. It contains 1400 sequences with a total length of over 3.5M frames (over 2500 frames on average). To validate the feasibility of TBACF in tracking application, the following methods are performed on LaSOT and compared, which include mixture density network (MDNel) [43], visual tracking via adversarial learning (VITAL) [44], SiamFC [39], learning temporal-spatial consistency correlation filter (TSCF) [45], structured Siamese network (StructSiam) [46], SiamRPN [40], ECO [38], Siamese instance search tracker (SINT) [47], STRCF [16], ECO-HC [38], tracker based on context-aware deep feature compression with multiple auto-encoders (TRACA) [48], cascade fusion network (CFNet) [49], BACF [14], SRDCF [13], parallel tracking and verifying (PTAV) [50], Staple [19], MEEM [41], hierarchical convolutional features (HCFT) [51], SAMF [18], LCT [27], DSST [20], tracking learning detection (TLD) [52], novel attentional feature-based correlation filter (SCT4) [53], KCF [12], color names (CN) [54], CSK [11], etc. It can be observed from Fig. 7 that TBACF outperforms the majority of above methods, especially for these CF-based trackers. For example, in comparison with CFNet [49], the success rate and precision are improved by 8.2% and 4.3%, respectively.

Figure 7.Performance evaluation and comparison on LaSOT.

4.5 VOT2016

The VOT2016 dataset [34] contains 60 challenging sequences. With the aim to substantiate the efficacy of TBACF in addressing the challenges posed by multiple targets, intricate scenarios, and varied motion dynamics, TBACF is evaluated on this dataset and compared with several representative trackers: Staple [19], ECO [38], tree-structure convolutional neural network (TCNN) [55], BACF [14], SRDCF [13], SRDCFdecon [17], and STRCF [16]. The obtained results are shown in Table 2. Obviously, TBACF achieves the highest precision of 57% and relatively higher results of the expected average overlap (EAO) at 30% and robustness at 36%. Its overall performance is significantly better than that of BACF because the temporal regularization adopted in our TBACF is especially advantageous to bad conditions. With respect to ECO and TCNN, both EAO and robustness of our TBACF are slightly inferior, however, its precision is superior. This means that TBACF is potential to eliminate the compromise between EAO, precision, and robustness among existing tracker and achieves a balanced performance.

Metric	Method
Metric	ECO [38]	Staple [19]	TCNN [55]	SRDCF [13]	BACF [14]	SRDCFdecon [17]	STRCF [16]	TBACF
EAO (%)	34	29	32	25	22	26	27	30
Precision (%)	54	54	55	52	56	53	53	57
Robustness (%)	27	38	30	44	48	41	38	36

Table 2. Results of top-8 trackers on VOT2016.

View all Tables

Based on the above results obtained on various datasets, it can be concluded that our TBACF proposed in this paper still situates a leading position among the current representative trackers, even if some methods show a better performance on certain datasets.

5 Ablation study

5.1 Effectiveness of different components

The ablation experiments are conducted on OTB-100, aiming at verifying the contribution of critical components in our TBACF. The results in Fig. 8(a) show that the “baseline” which refers to the original BACF with no update strategy (UP) and temporal regularization item (TR) achieves an AUC value of 83.6%. As desired, both “baseline+TR”, which is optimized with TR based on the temporal-confidence sample, and “baseline+TR+UP” (TBACF proposed in this paper) exhibit an enhanced performance with increased AUC of 86.7% and 88.1%, respectively. These results further demonstrate that the improvement of update strategy and the introduction of the temporal regularization item in our method play a critical role in this performance enhancement.

Figure 8.Results of ablation analysis on OTB-100: (a) baseline and (b) modified methods.

5.2 Effectiveness of different baselines

The temporal regularization term and update strategy are also embedded in SRDCF [13] as plug-ins, named TSRDCF, to verify the generalizability of our proposed method. As can be seen from Fig. 8(b), TSRDCF and TBACF modified with the proposed method in this paper are advantageous to their respective baselines: SRDCF and BACF. Compared with 83.6% AUC of BACF and 78.2% AUC of SRDCF, these values of TBACF and TSRDCF have increased 4.5% and 2.1%, respectively, and reached 80.3% and 88.1%. This means that the introduction of the temporal regularization term and the improvement of the update strategy proposed in this paper are effective and meanwhile universally applicable to improving the tracking performance of the existing methods.

6 Conclusions

This paper proposes a temporal regularization term to address the model drift problem caused by significant changes of target appearance. By introducing a temporal-confidence sample as the temporal regularization term, the filter can avoid focusing on erroneous targets, such as occlusions, during the update process, ensuring the correct update of the filter parameters. An increased update strategy is further proposed to ensure the high confidence of temporal-confidence samples. As a result, the temporal regularization term and update strategies are incorporated into the objective function of the baseline BACF, thus contributing to a more robust TBACF model which is especially desirable when the target appearance changes significantly. Its effectiveness is verified on OTB100, TC-128, UAV123, VOT2016, and LaSOT. The experimental results on the OTB-100 dataset demonstrate that our model is robust even under the situations of background clutter, deformation, occlusion, rotation, and illumination variation. The superiority and versatility of the proposed method are verified with ablation studies by applying the temporal regularization term and the update strategy as plug-ins in different baselines.

Despite these advances, our method also suffers from some limitations, such as complex computations mainly originating from the optimization of ADMM, reliance on manual adjustment of parameters, and the difficulty that the model is hard to autonomously learn parameter changes. In the future, further evaluations on other various datasets will be performed, to verify our method’s superior performance on handling occlusions in complex backgrounds as well as its robustness. We will also plan to integrate our method with deep learning by leveraging the excellent feature extraction capabilities and generalization performance of deep learning to further improve the method’s stability and robustness.

Disclosures

The authors declare no conflicts of interest.

References

[1] Li P.-X., Wang D., Wang L.-J., Lu H.-C.. Deep visual tracking: review and experimental comparison. Pattern Recogn., 76, 323-338(2018).

[2] Ding S.-H., Zhai Q., Li Y., Zhu J.-D., Zheng Y.-F., Xuan D.. Simultaneous body part and motion identification for human-following robots. Pattern Recogn., 50, 118-130(2016).

[3] Wu B.-Y., Hu B.-G., Ji Q.. A coupled hidden Markov random field model for simultaneous face clustering and tracking in videos. Pattern Recogn., 64, 361-373(2017).

[4] Li M.-H., Peng L.-B., Chen Y.-P., Huang S.-Q., Qin F.-Y., Peng Z.-M.. Mask sparse representation based on semantic features for thermal infrared target tracking. Remote Sens., 11, 1967(2019).

[5] Li M.-H., Peng L.-B., Wu T.-F., Peng Z.-M.. A bottom-up and top-down integration framework for online object tracking. IEEE T. Multimedia, 23, 105-119(2020).

[6] Dai M.-N., Xiao G., Cheng S.-Y., Wang D.-D., He X.-J.. Structural correlation filters combined with a Gaussian particle filter for hierarchical visual tracking. Neurocomputing, 398, 235-246(2020).

[7] Arulalan V., Premanand V., Kumar D.. Object detection and tracking using TSM-EFFICIENTDET and JS-KM in adverse weather conditions. J. Intell. Fuzzy Syst., 46, 2399-2413(2024).

[8] Zhang J.-M., Jin X.-K., Sun J., Wang J., Li K.-Q.. Dual model learning combined with multiple feature selection for accurate visual tracking. IEEE Access, 7, 43956-43969(2019).

[9] Z.Y. Huang, C.H. Fu, Y.M. Li, F.L. Lin, P. Lu, Learning aberrance repressed crelation filters f realtime UAV tracking, in: Proc. of the IEEECVF Intl. Conf. on Computer Vision, Seoul, Republic of Kea, 2019, pp. 2891–2900.

[10] D.S. Bolme, J.R. Beveridge, B.A. Draper, Y.M. Lui, Visual object tracking using adaptive crelation filters, in: Proc. of IEEE Computer Society Conf. on Computer Vision Pattern Recognition, San Francisco, USA, 2010, pp. 2544–2550.

[11] J.F. Henriques, R. Caseiro, P. Martins, J. Batista, Exploiting the circulant structure of trackingbydetection with kernels, in: Proc. of the 12th European Conf. on Computer Vision, Flence, Italy, 2012, pp. 702–715.

[12] Henriques J.F., Caseiro R., Martins P., Batista J.. High-speed tracking with kernelized correlation filters. IEEE T. Pattern Anal., 37, 583-596(2015).

[13] M. Danelljan, G. Häger, F.S. Khan, M. Felsberg, Learning spatially regularized crelation filters f visual tracking, in: Proc. of the IEEE Intl. Conf. on Computer Vision, Santiago, Chile, 2015, pp. 4310–4318.

[14] H.K. Galoogahi, A. Fagg, S. Lucey, Learning backgroundaware crelation filters f visual tracking, in: Proc. of the IEEE Intl. Conf. on Computer Vision, Venice, Italy, 2017, pp. 1144–1152.

[15] H.K. Galoogahi, T. Sim, S. Lucey, Crelation filters with limited boundaries, in: Proc. of the IEEE Conf. on Computer Vision Pattern Recognition, Boston, USA, 2015, pp. 4603–4638.

[16] F. Li, C. Tian, W.M. Zuo, L. Zhang, M.H. Yang, Learning spatialtempal regularized crelation filters f visual tracking, in: Proc. of the IEEE Conf. on Computer Vision Pattern Recognition, Salt Lake City, USA, 2018, pp. 4904–4913.

[17] M. Danelljan, A. Robinson, F.S. Khan, M. Felsberg, Beyond crelation filters: learning continuous convolution operats f visual tracking, in: Proc. of the 14th European Conf. on Computer Vision, Amsterdam, The herls, 2016, pp. 472–488.

[18] Y. Li, J.K. Zhu, A scale adaptive kernel crelation filter tracker with feature integration, in: Proc. of the European Conf. on Computer Vision, Zurich, Switzerl, 2015, pp. 254–265.

[19] L. Bertito, J. Valmadre, S. Golodetz, O. Miksik, P.H.S. Tr, Staple: complementary learners f realtime tracking, in: Proc. of the IEEE Conf. on Computer Vision Pattern Recognition, Las Vegas, USA, 2016, pp. 1401–1409.

[20] M. Danelljan, G. Häger, F.S. Khan, M. Felsberg, Accurate scale estimation f robust visual tracking, in: Proc. of the British Machine Vision Conf., Nottingham, UK, 2014, pp. 1–11.

[21] A. Lukezic, T. Vojir, L.C. Zajc, J. Matas, M. Kristan, Discriminative crelation filter with channel spatial reliability, in: Proc. of the IEEE Conf. on Computer Vision Pattern Recognition, Honolulu, USA, 2017, pp. 4847–4856.

[22] C. Ma, J.B. Huang, X.K. Yang, M.H. Yang, Hierarchical convolutional features f visual tracking, in: Proc. of the IEEE Intl. Conf. on Computer Vision, Santiago, Chile, 2015, pp. 3074–3082.

[23] J.R. Cai, M.Z. Xu, W. Li, et al., MeMOT: multiobject tracking with memy, in: Proc. of the IEEECVF Conf. on Computer Vision Pattern Recognition, New leans, USA, 2022, pp. 8090–8100.

[24] Z. Qin, S.P. Zhou, L. Wang, J.H. Duan, G. Hua, W. Tang, MotionTrack: learning robust shtterm longterm motions f multiobject tracking, in: Proc. of the IEEECVF Conf. on Computer Vision Pattern Recognition, Vancouver, Canada, 2023, pp. 17939–17948.

[25] B. Van Hoick, P. Tokmakov, S. Stent, J. Li, C. Vondrick, Tracking through containers occluders in the wild, in: Proc. of the IEEECVF Conf. on Computer Vision Pattern Recognition, Vancouver, Canada, 2023, pp. 13802–13812.

[26] H. Ren, S.D. Han, H.L. Ding, Z.W. Zhang, H.W. Wang, F.Q. Wang, Focus on details: online multiobject tracking with diverse finegrained representation, in: Proc. of the IEEECVF Conf. on Computer Vision Pattern Recognition, Vancouver, Canada, 2023, pp. 11289–11298.

[27] C. Ma, X.K. Yang, C.Y. Zhang, M.H. Yang, Longterm crelation tracking, in: Proc. of the IEEE Conf. on Computer Vision Pattern Recognition, Boston, USA, 2015, pp. 5388–5396.

[28] Crammer K., Dekel O., Keshet J., Shalev-Shwartz S., Singer Y., Warmuth M.K.. Online passive-aggressive algorithms. J. Mach. Learn. Res., 7, 551-585(2006).

[29] Boyd S., Parikh N., Chu E., Peleato B., Eckstein J.. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Le., 3, 1-122(2011).

[30] Sherman J., Morrison W.J.. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann. Math. Stat., 21, 124-127(1950).

[31] M.M. Wang, Y. Liu, Z.Y. Huang, Large margin object tracking with circulant feature maps, in: Proc. of the IEEE Conf. on Computer Vision Pattern Recognition, Honolulu, USA, 2017, pp. 4800–4808.

[32] Liang P., Blasch E., Ling H.. Encoding color information for visual tracking: algorithms and benchmark. IEEE T. Image Process., 24, 5630-5644(2015).

[33] U.T. Benchmark, A benchmark simulat f UAV tracking, in: Proc. of the European Conf. on Computer Vision, Amsterdam, The herls, 2016, pp. 1–14.

[34] Kristan M., Matas J., Leonardis A. et al. A novel performance evaluation methodology for single-target trackers. IEEE T. Pattern Anal., 38, 2137-2155(2016).

[35] H. Fan, L.T. Lin, F. Yang, et al., LaSOT: a highquality benchmark f largescale single object tracking, in: Proc. of the IEEECVF Conf. on Computer Vision Pattern Recognition, Long Beach, USA, 2019, pp. 5369–5378.

[36] Y. Wu, J. Lim, M.H. Yang, Online object tracking: a benchmark, in: Proc. of the IEEE Conf. on Computer Vision Pattern Recognition, Ptl, USA, 2013, pp. 2411–2418.

[37] Y.K. Qi, S.P. Zhang, L. Qin, et al., Hedged deep tracking, in: Proc. of the IEEE Conf. on Computer Vision Pattern Recognition, Las Vegas, USA, 2016, pp. 4303–4311.

[38] M. Danelljan, G. Bhat, F.S. Khan, M. Felsberg, ECO: efficient convolution operats f tracking, in: Proc. of the IEEE Conf. on Computer Vision Pattern Recognition, Honolulu, USA, 2017, pp. 6931–6939.

[39] L. Bertito, J. Valmadre, J.F. Henriques, A. Vedaldi, P.H.S. Tr, Fullyconvolutional Siamese wks f object tracking, in: Proc. of the European Conf. on Computer Vision, Amsterdam, The herls, 2016, pp. 850–965.

[40] B. Li, J.J. Yan, W. Wu, Z. Zhu, X.L. Hu, High perfmance visual tracking with Siamese region proposal wk, in: Proc. of the IEEE Conf. on Computer Vision Pattern Recognition, Salt Lake City, USA, 2018, pp. 8971–8980.

[41] J.M. Zhang, S.G. Ma, S. Sclaroff, MEEM: robust tracking via multiple experts using entropy minimization, in: Proc. of the 13th European Conf. on Computer Vision, Zurich, Switzerl, 2014, pp. 188–203.

[42] N. Wang, W. Zhou, Q. Tian, et al., Multicue crelation filters f robust visual tracking, in: Proc. of the IEEE Conf. on Computer Vision Pattern Recognition, Salt Lake City, USA, 2018, pp. 4844–4853.

[43] J. Karoliny, B. Etzlinger, A. Springer, Mixture density wks f WSN localization, in: Proc. of the IEEE Intl. Conf. on Communications Wkshops, Dublin, Irel, 2020, pp. 1–5.

[44] Y. Song, C. Ma, X. Wu, et al., VITAL: visual tracking via adversarial learning, in: Proc. of the IEEE Conf. on Computer Vision Pattern Recognition, Salt Lake City, USA, 2018, pp. 8990–8999.

[45] Zhu J., Wang D., Lu H.. Visual tracking by learning spatiotemporal consistency in correlation filters. Sci. China Inf. Sci., 50, 128-150(2020).

[46] Y. Zhang, L. Wang, J. Qi, et al., Structured Siamese wk f realtime visual tracking, in: Proc. of the European Conf. on Computer Vision, Munich, Germany, 2018, pp. 351–366.

[47] D. Held, S. Thrun, S. Savarese, Learning to track at 100 fps with deep regression wks, in: Proc. of the 14th European Conf. on Computer Vision, Amsterdam, The herls, 2016, pp. 749–765.

[48] J. Choi, H.J. Chang, T. Fischer, et al., Contextaware deep feature compression f highspeed visual tracking, in: Proc. of the IEEE Conf. on Computer Vision Pattern Recognition, Salt Lake City, USA, 2018, pp. 479–488.

[49] G. Zhang, Z. Li, J. Li, et al., CF: cade fusion wk f dense prediction, [Online]. Available, https:arxiv.gabs2302.06052, February 2023.

[50] H. Fan, H. Ling, Parallel tracking verifying: a framewk f realtime high accuracy visual tracking, in: Proc. of the IEEE Intl. Conf. on Computer Vision, Venice, Italy, 2017, pp. 5486–5494.

[51] C. Ma, J.B. Huang, X. Yang, et al., Hierarchical convolutional features f visual tracking, in: Proc. of the IEEE Intl. Conf. on Computer Vision, Santiago, Chile, 2015, pp. 3074–3082.

[52] Kalal Z., Mikolajczyk K., Matas J.. Tracking-learning-detection. IEEE T. Pattern Anal. Mach. Intell., 34, 1409-1422(2012).

[53] J. Choi, H.J. Chang, J. Jeong, et al., Visual tracking using attentionmodulated disintegration integration, in: Proc. of the IEEE Conf. on Computer Vision Pattern Recognition, Las Vegas, USA, 2016, pp. 4321–4330.

[54] M. Danelljan, F.S. Khan, M. Felsberg, et al., Adaptive col attributes f realtime visual tracking, in: Proc. of the IEEE Conf. on Computer Vision Pattern Recognition, Columbus, USA, 2014, pp. 1090–1097.

[55] H. Nam, M. Baek, B. Han, Modeling propagating CNNs in a tree structure f visual tracking [Online]. Available, https:arxiv.gabs1608.07242, August 2016.

微信扫一扫：分享

微信扫一扫：分享