Fig. 1. Overall algorithm architecture
Fig. 2. Dual-mode encoder structure
Fig. 3. Gated fusion network structure
Fig. 4. P-R curves of the two models with different modal inputs
Fig. 5. Detection results on the KAIST dataset
Fig. 6. Detection results on the GIR dataset
Algorithm | Resolution | AP0.5:0.95 | AP0.5 |
---|
Ours-n | 416×416 | 30.5 | 70 | Ours-n | 512×512 | 32.5 | 73.1 | Ours-n | 608×608 | 32.9 | 73.3 | Ours-n | 640×640 | 33.3 | 73.8 |
|
Table 1. Detector performance for different input image pairs sizes on n-model
Algorithm | Resolution | AP0.5:0.95 | AP0.5 |
---|
Ours-s | 416×416 | 31.1 | 71 | Ours-s | 512×512 | 31.9 | 72.7 | Ours-s | 608×608 | 34.3 | 73.9 | Ours-s | 640×640 | 35.2 | 74.5 |
|
Table 2. Detector performance for different input image pairs sizes on s-model
Method | Encoder-VS | Encoder-IR | Gated Fusion | Input | AP0.5:0.95 | AP0.5 | FPS |
---|
YOLOv5-n | | | | VS | 24.8 | 58.7 | 158.7 | YOLOv5-n | | | | IR | 31.6 | 71 | 158.7 | YOLOv5-n-EVS | √ | | | VS | 25 | 59.1 | 125 | YOLOv5-n-EIR | | √ | | IR | 31.8 | 71.3 | 125 | Ours-n | √ | √ | √ | VS+IR | 33.3 | 73.8 | 117.6 | YOLOv5-s | | | | VS | 26.7 | 59.8 | 112.4 | YOLOv5-s | | | | IR | 32 | 71.5 | 112.4 | YOLOv5-s-EVS | √ | | | VS | 26.9 | 60.2 | 107.5 | YOLOv5-s-EIR | | √ | | IR | 32.2 | 71.9 | 107.5 | Ours-s | √ | √ | √ | VS+IR | 35.2 | 74.5 | 102 |
|
Table 3. Ablation experimental results of different models on the KAIST dataset
Method | Encoder-VS | Encoder-IR | Gating Fusion | Input | AP0.5:0.95 | AP0.5 | FPS |
---|
YOLOv5-n | | | | VS | 48.4 | 88.8 | 158.7 | YOLOv5-n | | | | IR | 36.3 | 75.5 | 158.7 | YOLOv5-n-EVS | √ | | | VS | 49.4 | 89.1 | 105.3 | YOLOv5-n-EIR | | √ | | IR | 36.4 | 76.3 | 105.3 | Ours-n | √ | √ | √ | VS+IR | 49.7 | 89.8 | 101 | YOLOv5-s | | | | VS | 51.4 | 89.9 | 111.1 | YOLOv5-s | | | | IR | 36.6 | 76.8 | 111.1 | YOLOv5-s-EVS | √ | | | VS | 51.9 | 90.1 | 91.7 | YOLOv5-s-EIR | | √ | | IR | 36.7 | 77 | 91.7 | Ours-s | √ | √ | √ | VS+IR | 52.2 | 90.5 | 85.5 |
|
Table 4. Ablation experimental results of different models on the GIR dataset
Class | Ours-n | YOLOv5-n-VS | YOLOv5-n-IR | Ours-s | YOLOv5-s-VS | YOLOv5-s-IR |
---|
Person | 90.7 | 91.2 | 84.0 | 91.7 | 91.7 | 85.4 | Dog | 99.5 | 99.5 | 99.5 | 99.5 | 99.5 | 91.6 | Car | 95.4 | 95.2 | 94.3 | 95.8 | 95.1 | 94.7 | Bicycle | 80.4 | 83.7 | 70.7 | 80.8 | 84.7 | 72.8 | Plant | 85.6 | 84.5 | 79.1 | 86.4 | 87.0 | 76.0 | Motorcycle | 82.8 | 82.0 | 76.1 | 83.9 | 82.4 | 77.7 | Umbrella | 86.0 | 87.8 | 70.5 | 85.7 | 86.6 | 76.1 | Kite | 93.6 | 82.9 | 64.6 | 94.4 | 89.2 | 67.6 | Toy | 95.6 | 96.3 | 86.7 | 96.4 | 97.0 | 83.7 | Ball | 88.5 | 84.7 | 29.5 | 90.7 | 85.5 | 42.1 |
|
Table 5. The detection accuracy of the proposed algorithm and the baseline algorithm(AP0.5%)
Input | Algorithm | Backbone | Resolution | AP0.5:0.95 | AP0.5 | FPS |
---|
VS | Faster R-CNN(2015) | ResNet-50 | 1 000×600 | 24.2 | 58.3 | 15.2 | SSD(2016) | VGG-16 | 512×512 | 18.1 | 48.2 | 38.1 | RetinaNet(2017) | ResNet-50 | 1 333×800 | 22.5 | 57.7 | 16.6 | YOLOv3(2018) | DarkNet-53 | 416×416 | 18.3 | 46.7 | 56.2 | FCOS(2019) | ResNet-50 | 1 333×800 | 22.7 | 56.7 | 18.3 | ATSS(2020) | ResNet-50 | 1 333×800 | 24.3 | 57.8 | 17 | YOLOv4(2020) | CSPDarkNet-53 | 416×416 | 23.7 | 57.4 | 55 | YOLOX-s(2021) | Modified CSP v5 | 416×416 | 27 | 61.1 | 48.4 | YOLOX-m(2021) | Modified CSP v5 | 416×416 | 27.7 | 61.8 | 40.3 | YOLOF(2021) | ResNet-50 | 1 333×800 | 22.2 | 54.1 | 25.7 | YOLOv5-n(2020) | Modified CSP v5 | 640×640 | 24.8 | 58.7 | 158.7 | YOLOv5-s(2020) | Modified CSP v5 | 640×640 | 26.4 | 59.8 | 112.4 | YOLOv5-n-EVS | Modified CSP v5 | 640×640 | 25 | 59.1 | 125 | YOLOv5-s-EVS | Modified CSP v5 | 640×640 | 26.9 | 60.2 | 107.5 | IR | Faster R-CNN(2015) | ResNet-50 | 1 000×600 | 28.8 | 68.6 | 12 | SSD(2016) | VGG-16 | 512×512 | 23.2 | 60.9 | 34 | RetinaNet(2017) | ResNet-50 | 1 333×800 | 27.8 | 68.2 | 14.1 | YOLOv3(2018) | DarkNet-53 | 416×416 | 25.3 | 63.6 | 37 | FCOS(2019) | ResNet-50 | 1 333×800 | 29.6 | 69.4 | 14 | ATSS(2020) | ResNet-50 | 1 333×800 | 29 | 69 | 13.8 | YOLOv4(2020) | CSPDarkNet-53 | 416×416 | 27.4 | 68.5 | 52.6 | YOLOX-s(2021) | Modified CSP v5 | 416×416 | 32.8 | 72.1 | 45 | YOLOX-m(2021) | Modified CSP v5 | 416×416 | 33.5 | 73.1 | 40 | YOLOF(2021) | ResNet-50 | 1 333×800 | 27.3 | 65.6 | 25 | YOLOv5-n(2020) | Modified CSP v5 | 640×640 | 31.6 | 71 | 158.7 | YOLOv5-s(2020) | Modified CSP v5 | 640×640 | 32 | 71.5 | 112.4 | YOLOv5-n-EIR | Modified CSP v5 | 640×640 | 31.8 | 71.3 | 125 | YOLOv5-s-EIR | Modified CSP v5 | 640×640 | 32.2 | 71.9 | 107.5 | VS+IR | MMTOD(2019)[18] | ResNet-101 | 1 000×600 | 31.1 | 70.7 | 13.2 | CMDet(2021)[37] | ResNet-101 | 640×512 | 28.3 | 68.4 | 25.3 | RISNet(2022)[38] | DarkNet-53 | 416×416 | 33.1 | 72.7 | 23 | Ours-n | Modified CSP v5 | 640×640 | 33.3 | 73.8 | 117.6 | Ours-s | Modified CSP v5 | 640×640 | 35.2 | 74.5 | 102 |
|
Table 6. Comparative experimental results on the KSIAT dataset
Input | Algorithm | Backbone | Resolution | AP0.5:0.95 | AP0.5 | FPS |
---|
VS | YOLOv3(2018) | DarkNet-53 | 416×416 | 41.2 | 85.7 | 50 | FCOS(2019) | ResNet-50 | 1 333×800 | 40.4 | 84 | 16 | ATSS(2020) | ResNet-50 | 1 333×800 | 47.1 | 87.1 | 14 | YOLOv4(2020) | CSPDarkNet-53 | 416×416 | 44.5 | 87.9 | 53 | YOLOX-s(2021) | Modified CSP v5 | 416×416 | 51.7 | 90.3 | 52 | YOLOv5-n(2020) | Modified CSP v5 | 640×640 | 48.4 | 88.8 | 158.7 | YOLOv5-s(2020) | Modified CSP v5 | 640×640 | 51.4 | 89.8 | 111.1 | YOLOv5-n-EVS | Modified CSP v5 | 640×640 | 49.4 | 89.1 | 105.3 | YOLOv5-s-EVS | Modified CSP v5 | 640×640 | 51.9 | 90.1 | 91.7 | IR | YOLOv3(2018) | DarkNet-53 | 416×416 | 35.6 | 74.2 | 48.4 | FCOS(2019) | ResNet-50 | 1 333×800 | 34.5 | 72.3 | 12 | ATSS(2020) | ResNet-50 | 1 333×800 | 35.2 | 73.4 | 11.7 | YOLOv4(2020) | CSPDarkNet-53 | 416×416 | 35.8 | 74.7 | 49 | YOLOX-s(2021) | Modified CSP v5 | 416×416 | 36.9 | 76.3 | 53 | YOLOv5-n(2020) | Modified CSP v5 | 640×640 | 36.3 | 75.5 | 158.7 | YOLOv5-s(2020) | Modified CSP v5 | 640×640 | 36.6 | 76.8 | 111.1 | YOLOv5-n-EIR | Modified CSP v5 | 640×640 | 36.4 | 76.3 | 105.3 | YOLOv5-s-EIR | Modified CSP v5 | 640×640 | 36.7 | 77 | 91.7 | VS+IR | MMTOD(2019)[18] | ResNet-101 | 1 000×600 | 40.7 | 84.3 | 11.2 | CMDet(2021)[37] | ResNet-101 | 640×512 | 48.6 | 88.9 | 22.7 | RISNet(2022)[38] | DarkNet-53 | 416×416 | 49.3 | 89.2 | 23.3 | Ours-n | Modified CSP v5 | 640×640 | 49.7 | 89.8 | 101 | Ours-s | Modified CSP v5 | 640×640 | 52.2 | 90.5 | 85.5 |
|
Table 7. Comparative experimental results on the GIR dataset