Fig. 1. Network architecture
Fig. 2. Initial module
Fig. 3. Feature fusion module
Fig. 4. Mix down-sampling module
Fig. 5. Prediction modules. (a) Plain prediction module; (b) dense prediction module
Fig. 6. Qualitative detection results
Initial block | MAP /% | Speed /(frame·s-1) |
---|
7×7-s2 | 79.9 | 85 | 3×3 conv-s1,3×3conv-s1,3×3 conv-s2, | 80.7 | 85 | 3×3 conv-s1,3×3conv-s2,3×3 conv-s1, | 80.6 | 84 | 3×3 conv-s2,3×3conv-s1,3×3 conv-s1 | 81.0 | 85 |
|
Table 1. Comparison of different initial modules
Feature fusion method | MAP /% | Speed /(frame·s-1) |
---|
Without fusion | 80.4 | 91 | Sum | 79.8 | 96 | Concatenation+1×1 conv | 81.0 | 85 |
|
Table 2. Comparison of different feature fusion methods
Down-sampling module | MAP /% | Speed /(frame·s-1) |
---|
3×3 conv-s2 | 79.8 | 81 | 2×2 max pool-s2 | 79.8 | 84 | Mix down-sampling | 81.0 | 85 |
|
Table 3. Comparison of different down-sampling modules
Prediction module | MAP /% | Speed /(frame·s-1) |
---|
Plain prediction module | 80.1 | 89 | Dense prediction module | 81.0 | 85 |
|
Table 4. Comparison of different prediction modules
Backbonenetwork | Depth | Pre-train | SSD | DSOD | RFBNet | | |
---|
MAP /% | Speed /(frame·s-1) | | | MAP /% | Speed /(frame·s-1) | MAP /% | Speed /(frame·s-1) |
---|
VGG | 16 | √ | 77.5 | 130 | 78.1 | 79 | 78.9 | 81 | VGGBN | 16 | × | 79.5 | 95 | 79.5 | 89 | 79.9 | 71 | ResNet | 101 | × | 76.0 | 42 | 75.5 | 42 | 77.1 | 38 | DenseNet | 121 | × | 74.6 | 37 | 75.1 | 32 | 75.3 | 29 | DS/64-192-48-1 | 67 | × | 78.5 | 51 | 78.8 | 47 | 79.4 | 42 | Root-ResNet-34 | 34 | × | 80.2 | 79 | 80.6 | 75 | 81.3 | 61 | DNet | 25 | × | 80.1 | 89 | 81.0 | 85 | 80.5 | 65 |
|
Table 5. Detection resultsof different backbone networks in SSD, DSOD, and RFBNet models
Method | Pre-train | Backbone network | Input size /(pixel×pixel) | MAP /% | Speed /(frame·s-1) |
---|
SSD[11] | √ | VGG-16 | 300×300 | 77.2 | 46 | SSD* | √ | VGG-16 | 300×300 | 77.7 | 130 | YOLOv2[26] | √ | DarkNet-19 | 544×544 | 78.6 | 81 | RFBNet[25] | √ | VGG-16 | 300×300 | 80.5 | 83 | DSSD[27] | √ | ResNet-101 | 300×300 | 78.6 | 8 | Faster R-CNN[8] | √ | ResNet-101 | ~1000×600 | 76.4 | 2.4 | RFCN[28] | √ | ResNet-101 | ~1000×600 | 80.5 | 9 | DSOD[19] | × | DS/64-192-48-1 | 300×300 | 77.7 | 17.4 | ScratchDet[20] | × | Root-ResNet-34 | 300×300 | 80.4 | 17.8 | Proposed | × | DNet | 300×300 | 81.0 | 85 |
|
Table 6. Detection results of different detectors on the PASCAL VOC dataset