Single Object Tracking of LiDAR Point Cloud Combined with Auxiliary Deep Neural Network

Xiaoyu Zhou; Ling Wang; Yanxin Ma; Peibo Chen

doi:10.3788/CJL202148.2110001

Abstract

Objective Point cloud can be selected as an ideal data format for tasks such as object classification, detection, segmentation, reconstruction, and tracking in a three-dimensional (3D) scene. In the case of a single object tracking task, the approach of considering point clouds as data outperforms that of selecting two-dimensional (2D) picture or video sequences for two reasons. First, the point cloud can better describe the 3D geometric information of the object in real scenes, such as the position, scale, and posture of the object. Second, different from the passive optical imaging principle of cameras, information is collected using light detection and ranging (LiDAR) following an active imaging approach, which is not prone to be affected by natural light conditions. Therefore, the point cloud can adapt to different conditions involving visual degradation or illumination and is robust to glare, reflections, and shadows. Based on this discussion, the single object tracking of 3D point cloud is a topic worth investigating. Generally, single object tracking tasks aim to use the information in the given initial frame to determine the tracked object and predict the locating bounding box of the object in each subsequent frame. However, existing single object trackers of LiDAR point clouds exhibit a poor tracking performance of sparsely distributed and small-scale point cloud objects. This is mainly attributed to the downscaling operation applied to features extracted from the point cloud, leading to the insufficient application of object’s structural information; this distracts the tracker from performing accurate bounding box predictions of sparsely distributed and small-scale point cloud objects.

Methods To address this problem, a single object tracking network combined with auxiliary deep neural network is proposed herein. During the training stage, we attach a modified auxiliary network to the backbone network, which accomplishes two auxiliary tasks: 1) foreground point cloud segmentation, which guides the backbone network to focus on pointwise semantic information; 2) pointwise center coordinate offset regression, which leads the features to be aware of the intrastructural information of the object. These two tasks are jointly supervised using the backbone network such that the semantic and structural features are naturally stored in the object features extracted using the backbone network. However, during the inference stage, the auxiliary network is bypassed in this process because the trained backbone network is already optimized to be structure aware and detaching the auxiliary network can avoid extra computational cost, which is essential for retaining the real-time performance of the tracker. Moreover, we notice that the latest work follows the same manner as the dataset organization. In particular, the number of input points in the search area point cloud and template point cloud is fixed, irrespective of the class of point cloud data. However, as the KITTI dataset presents, the point cloud of some classes is dense and comprises a large number of points, while the point cloud of other classes suffers from scarce points, providing insufficient and limited object information. A fixed number of input points may be unsuitable for all data classes. Hence, we propose setting different input quantities for each class during both the training and inference stages, which is accomplished without changing the network structure. In general, the structure of different network modules is shown in Fig. 1 to Fig. 6 respectively.

Results and Discussions Both qualitative and quantitative experiments are conducted to prove the superiority of our propositions. Table 1 shows the result of extensive comparisons between our proposed tracker and other two former trackers; our tracker achieves better results in three of four data classes and shows higher mean performance than the other two former trackers. Test results on the KITTI dataset show that our network increases the average tracking success by 0.89 percent and the average tracking accuracy by 2.51 percent under the same parameter settings as the existing work. Second, some specific tracking results of four data classes are depicted in Fig. 7--Fig. 10. These figures show that our tracker can predict bounding boxes closely similar to the ground truth. Furthermore, we present some results of the tasks processed using the auxiliary network. Concretely, Fig. 11--Fig. 14 show the results of foreground segmentation task, in which our auxiliary network can accurately segment the surface of the object from the background points. Moreover, details of comparisons on tracking performance with different numbers of the search area point cloud and template point cloud are discussed in Table 2, which prove the efficiency of adjusting a suitable quantity of input points of different data classes. Compared with proposed algorithm, the average tracking success increases by 4.54 percent, and the average tracking accuracy increases by 7.83 percent. In particular, for class Cyclist, the average tracking success increases by 2.38 percent and the average tracking accuracy increases by 2.86 percent; for class Pedestrian, the average tracking success increases by 8.47 percent and the average tracking accuracy increases by 13.71 percent. These findings imply that our tracker achieves improvements in terms of the tracking performance of sparse and small-scale objects. Finally, Table 3 and Table 4 show that we succeed in maintaining a balance between performance and computational costs.

Conclusions In summary, the proposed method achieves reasonable results in addressing the problem of tracking sparsely distributed and small-scale point cloud objects as expected, and can be applied to solve other tasks. Inspired by our experiment results, to achieve further improvement, we will seek new approaches on data augmentation and extract more useful clues using the background information for search area updates.