Deep Learning Based on Semantic Segmentation for Three-Dimensional Object Detection from Point Clouds

Liang Zhao; Jie Hu; Han Liu; Yongpeng An; Zongquan Xiong; Yu Wang

doi:10.3788/CJL202148.1710004

Abstract

Objective The low detection accuracy of the perception system in autonomous vehicles will seriously affect the reliability of autonomous vehicle and the safety of passengers. The traditional LiDAR-based three-dimensional (3D) object detection algorithms, such as the rule-based clustering method highly relies on hand-designed features probably be sub-optimal. Following the great advantages in deep learning for image field, a large body of literature to explore the application of this technology for 3D LiDAR point clouds. Among them, point-based methods directly use raw point clouds as the input of the detection model, and the further point sampling(FPS) algorithm is applied to sample a set of keypoints from raw point clouds, keypoints groups neighbor raw points to extract the feature for object detection. However, the proportion of foreground points (points in 3D bounding box) in keypoints collected through FPS algorithm are relatively low, especially for the remote object, foreground points almost totally lost in FPS (Fig. 1). Foreground points contain the important 3D space location information of objects, a low proportion of foreground points in keypoints will hurt the detection accuracy. To this end, we propose a semantic segmentation based two-stage 3D object detection algorithm named Seg-RCNN (segmentation based region-convolution neural networks), in which we propose a novel further point sampling strategy (segFPS) for sampling keypoints, and a segmentation network (SegNet) for semantic segmentation of foreground points and background points (Fig. 4).

Methods Seg-RCNN is a two-stage 3D object detector (Fig. 2), in the first stage, the raw points are first voxelized as voxel-wise features, then the sparse 3D CNN is adopted to extract voxel features, the output of sparse 3D CNN squeeze into 2D CNN for further feature propagation, and then box proposals are generated in 2D bird’s eye view feature map through anchor-based strategy. The SegNet output the foreground and background points segmentation results of point clouds. In the second stage, the SegFPS is adopted to sample the keypoints according to the segmentation results obtained from SegNet. SegFPS, distinguished with previous FPS, uses both segmentation classes (foreground points and background points) and Euclidean distance as sampling criteria, which can improve the proportion of foreground points in keypoints (Fig. 1), and then can improve the detection accuracy by 2.90 percentage points in the KITTI dataset. Using keypoints to represent the whole point clouds not only reduces the complexity in time and space but also retains a certain proportion of foreground points and background points. Multi-scales 3D voxel features of different layers are aggregated into a set of keypoints through PointNet backbone to obtain the features aggregated by the keypoints (named key-features), so that achieve feature compression (grouping, as shown in Fig. 3, calculates the relative distance between the keypoints and neighbor raw points). Then, 2D CNN is adopted to further propagate the key-features. Project the box proposals onto the key-features map to extract the region of interests, finally the detection heads output the final perception results.

Results and Discussions Extensive experiments on KITTI dataset are conducted to demonstrate the higher performance of our framework as compared with previous mainstream methods, and the detection accuracy of the car class in moderate and easy level are 79.73%,89.16%, respectively(Table 2). The mean average precision (mAP) of Seg-RCNN, on car objects easy, moderate and hard levels, increased by at least 3.22, 3.97 and 3.29 percentage points, respectively. There are two output strategies in SegNet, experiment results suggest that SegNet 1 is better than SegNet 2 (Table 4). Adopting SegFPS to sample the keypoints indeed can improve the detection accuracy by 2.90 percentage points as compared with FPS (Table 5). The accuracy of annotation in dataset affect the detection performance of algorithm (Fig. 6), since the correct detection box will be judged as false positive due to the missing of data annotation. The similarity of the shape between different objects, such as the shape of tree pole and telephone pole are very similar to the pedestrian in point clouds, would decrease the classification accuracy, and then decrease the detection accuracy(Fig. 7). Runtime of the proposed method is 80 ms (Table 8). To further facilities the application of engineering, we achieve online real-time detection through robot operating system, which has great values for engineering projects.

Conclusions In this paper, we consider the problem of low proportion of foreground points in keypoints after FPS sampling, and introduce Seg-RCNN, a novel 3D object detection algorithm that has potential values of application for autonomous vehicle projects. Extensive experiments on KITTI dataset are conducted, the experiment results suggest that our algorithm has higher detection accuracy as compared with previous mainstream methods, specifically, the mAP of car class on easy, moderate and hard levels increase to at least 3.22, 3.97 and 3.29 percentage points, respectively. The runtime of our algorithm is only 80 ms. Our results suggest that Seg-RCNN is an effective architecture for 3D object detection on point clouds.