• Chinese Journal of Lasers
  • Vol. 50, Issue 10, 1010003 (2023)
Jie Hu1、2、3、*, Yongpeng An1、2、3, Wencai Xu1、2、3, Zongquan Xiong1、2、3, and Han Liu1、2、3
Author Affiliations
  • 1Hubei Key Laboratory of Advanced Technology for Automotive Components, Wuhan University of Technology, Wuhan 430070, Hubei, China
  • 2Hubei Collaborative Innovation Center for Automotive Components Technology,Wuhan University of Technology, Wuhan 430070, Hubei, China
  • 3Hubei Research Center for New Energy & Intelligent Connected Vehicle, Wuhan University of Technology, Wuhan 430070, Hubei, China
  • show less
    DOI: 10.3788/CJL220811 Cite this Article Set citation alerts
    Jie Hu, Yongpeng An, Wencai Xu, Zongquan Xiong, Han Liu. 3D Object Detection Based on Deep Semantics and Position Information Fusion of Laser Point Cloud[J]. Chinese Journal of Lasers, 2023, 50(10): 1010003 Copy Citation Text show less

    Abstract

    Object

    Precise perception of the surrounding environment is the basis for realizing various functions in autonomous driving. The accurate identification of the location of 3D targets in real scenes is key to improving the overall performance of autonomous driving. Lidar has become pivotal in this field because of its superiority in sensing richer 3D spatial information while being less affected by weather and other environmental factors. Current 3D target detection methods are mainly based on deep learning, which can achieve a higher detection accuracy than traditional clustering and segmentation algorithms. The key to target detection based on deep learning is the in-depth extraction and utilization of point-cloud feature information. If feature information cannot be fully utilized, the target is misdetected or missed (Fig. 1), which has a significant impact on the safety of the automatic driving function. Therefore, deep extraction and utilization of point cloud information are key to improving the accuracy of 3D target detection.

    Methods

    This study proposes a two-stage 3D target detection network (DSPF-RCNN, Fig. 1). In the first stage, the unordered original point cloud is divided into the regular voxel space, and the point-wise feature is converted into voxel-wise feature by using convolution neural network. The down-sampling output of the last layer is transformed into a 2D bird's eye view (BEV), whereby the BEV is input into the deep feature extraction-region proposal network (DFE-RPN, Fig. 2) for depth extraction of 2D features. Through the fusion of deep and shallow texture features with deep semantic features, the ability of the network to capture 2D image features is enhanced. In the second stage, some point clouds are selected as center points in the latter two 3D down-sampling voxel spaces through the farthest point sampling, and the center points are input into the aware-point semantics and position feature fusion (ASPF) module (Fig. 3), allowing the integration of the 3D semantic features and location information of the surrounding point clouds. In this manner, the network can adaptively extract more diverse features of the target because these center points have a stronger feature aggregation ability when aggregating neighboring point clouds, which improves the network's ability to aggregate different feature information of the target. These center points are then used to aggregate the features of the surrounding point clouds in the 3D voxel space (Fig. 4). Subsequently, the region-of-interest pooling is conducted for the aggregated features and target candidate boxes generated in the first stage. Finally, the more refined classification and boundary box regression are conducted for the target through the fully connected layer.

    Discussions

    The DSPF-RCNN is tested and evaluated using the official KITTI test and validation sets. The detection results for Car are better than those of the existing mainstream algorithms in the test set (Table 1), and the detection accuracies at the three difficulty levels are 89.90%, 81.04%, and 76.45%. In the KITTI validation set (Table 2), at the 11 recall positions, the detection accuracy is improved by 4% compared with those of the SVGA-Net and Part-A2 networks at moderate levels for Car and Cyclist. The DSPF-RCNN can accurately detect the three types of targets (Fig. 5). The effectiveness of the proposed innovation module is further compared and analyzed (Table 5). The results show that, after integrating the 3D semantic features and position features of the surrounding point cloud, the central point can better aggregate the feature information of the surrounding point cloud in the feature aggregation stage. However, when the DFE-RPN module is added, the network's ability to capture features increase further, and the ability to extract small-target feature information, such as cyclists and pedestrians, is significantly improved. Finally, a comparative analysis is performed on the network time utilization, including the time consumed by each module in reasoning through a frame of point cloud data (Table 6). The comparison between DSPF-RCNN and the other two-stage algorithms (Table 7) shows that the total inference time of DSPF-RCNN is 64 ms, which is more advantageous in terms of the inference speed of the two-stage algorithm. Finally, the algorithm is deployed on a real vehicle platform to realize online detection (Fig. 7).

    Conclusions

    In this study, a two-stage target detection algorithm, the DSPF-RCNN, based on a laser point cloud is proposed. First, the proposed DFE-RPN module extracts abundant target feature information from 2D images. In the second stage, the proposed ASPF module allows the central points to aggregate the salient features of different targets. Through testing on the KITTI test set and validation set, and comparison with mainstream methods, it is concluded that DSPF-RCNN performance is more advantageous in accurately detecting targets with different sizes, including small targets. At moderate levels in the KITTI validation set, the detection accuracies for Car and Cyclist are improved by approximately 4%, and the total network inference time is 64 ms. Finally, the DSPF-RCNN is applied to a local dataset to verify its engineering value.

    Jie Hu, Yongpeng An, Wencai Xu, Zongquan Xiong, Han Liu. 3D Object Detection Based on Deep Semantics and Position Information Fusion of Laser Point Cloud[J]. Chinese Journal of Lasers, 2023, 50(10): 1010003
    Download Citation