FAANet: feature-aligned attention network for real-time multiple object tracking in UAV videos

Zhenqi Liang; Jingshi Wang; Gang Xiao; Liu Zeng

doi:10.3788/COL202220.081101

Journals >Chinese Optics Letters >Volume 20 >Issue 8 >Page 081101 > Article

Chinese Optics Letters
Vol. 20, Issue 8, 081101 (2022)

FAANet: feature-aligned attention network for real-time multiple object tracking in UAV videos

Zhenqi Liang¹, Jingshi Wang^1、2, Gang Xiao^1、*, and Liu Zeng¹

Author Affiliations

¹School of Aeronautics and Astronautics, Shanghai Jiao Tong University, Shanghai 200240, China

²Jiangsu Automation Research Institute, Lianyungang 222061, China

show less

DOI: 10.3788/COL202220.081101 Cite this Article Set citation alerts

Zhenqi Liang, Jingshi Wang, Gang Xiao, Liu Zeng. FAANet: feature-aligned attention network for real-time multiple object tracking in UAV videos[J]. Chinese Optics Letters, 2022, 20(8): 081101 Copy Citation Text

EndNote(RIS)

BibTex

Plain Text

show less

Abstract

Multiple object tracking (MOT) in unmanned aerial vehicle (UAV) videos has attracted attention. Because of the observation perspectives of UAV, the object scale changes dramatically and is relatively small. Besides, most MOT algorithms in UAV videos cannot achieve real-time due to the tracking-by-detection paradigm. We propose a feature-aligned attention network (FAANet). It mainly consists of a channel and spatial attention module and a feature-aligned aggregation module. We also improve the real-time performance using the joint-detection-embedding paradigm and structural re-parameterization technique. We validate the effectiveness with extensive experiments on UAV detection and tracking benchmark, achieving new state-of-the-art 44.0 MOTA, 64.6 IDF1 with 38.24 frames per second running speed on a single 1080Ti graphics processing unit.

Keywords

deep learning feature alignment multiple object tracking unmanned aerial vehicle

w_{channel} = σ (Conv 1 d (g (F))),

(1)

View in Article

F_{channel} = w_{channel} ⊙ Conv 2 d (F) + Conv 2 d (F),

(2)

View in Article

F_{1} = soft \max (f_{1} (g (Conv 2 d (F_{channel})))),

(3)

View in Article

F_{2} = f_{2} (Conv 2 d (F_{channel})),

(4)

View in Article

w_{spatial} = σ (f_{3} (F_{1} \otimes F_{2})),

(5)

View in Article

F_{spatial} = w_{spatial} ⊙ F_{channel},

(6)

View in Article

F_{output} = F_{channel} + F_{spatial} .

(7)

View in Article

F_{concat} = Concat (upsample (F_{low}), F_{high}),

(8)

View in Article

Δ_{low} = Conv 2 d_{3 \times 3} (ReLU (BN (Conv 2 d_{1 \times 1} (F_{concat})))),

(9)

View in Article

Δ_{high} = Conv 2 d_{3 \times 3} (ReLU (BN (Conv 2 d_{1 \times 1} (F_{concat})))),

(10)

View in Article

F_{output} = f (upsample (F_{low}), Δ_{low}) + f (F_{high}, Δ_{high}),

(11)

View in Article

A_{h, w} = \sum_{h^{'} = 1}^{H} \sum_{w^{'} = 1}^{W} F_{h^{'}, w^{'}} \cdot \max (0, 1 - | h + Δ_{1 h w} - h^{'} |) \cdot \max (0, 1 - | w + Δ_{2 h w} - w^{'} |),

(12)

View in Article

Zhenqi Liang, Jingshi Wang, Gang Xiao, Liu Zeng. FAANet: feature-aligned attention network for real-time multiple object tracking in UAV videos[J]. Chinese Optics Letters, 2022, 20(8): 081101

Download Citation

EndNote(RIS)

BibTex

Plain Text

Set citation alerts for the article

Tools

Set citation alerts for the article

Save the article for my favorites

Paper Information