• Optics and Precision Engineering
  • Vol. 32, Issue 2, 237 (2024)
Shibei LIU and Ying CHEN*
Author Affiliations
  • Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), Jiangnan University, Wuxi214122, China
  • show less
    DOI: 10.37188/OPE.20243202.0237 Cite this Article
    Shibei LIU, Ying CHEN. Audio object detection network with multimodal cross level feature knowledge transfer[J]. Optics and Precision Engineering, 2024, 32(2): 237 Copy Citation Text show less
    Schematic of RGB, depth and audio information
    Fig. 1. Schematic of RGB, depth and audio information
    Multimodal knowledge distillation target detection network
    Fig. 2. Multimodal knowledge distillation target detection network
    Cross-level fusion and no cross-level feature heatmaps
    Fig. 3. Cross-level fusion and no cross-level feature heatmaps
    Cross-level feature knowledge transfer loss based on attentional fusion
    Fig. 4. Cross-level feature knowledge transfer loss based on attentional fusion
    Attention fusion module(AFM) and the KL divergence calculation module(KLD)
    Fig. 5. Attention fusion module(AFM) and the KL divergence calculation module(KLD)
    Selection diagram of image and audio
    Fig. 6. Selection diagram of image and audio
    Example images of MAVD dataset
    Fig. 7. Example images of MAVD dataset
    Comparison of object detection capability under different network architecture
    Fig. 8. Comparison of object detection capability under different network architecture
    Schematic diagram of different fusion modes
    Fig. 9. Schematic diagram of different fusion modes
    Qualitative comparison of vehicle detection capability with or without LDLoss
    Fig. 10. Qualitative comparison of vehicle detection capability with or without LDLoss
    Los curves for MTALoss and MCFTLoss
    Fig. 11. Los curves for MTALoss and MCFTLoss
    Qualitatively compares the vehicle detection capabilities of the baseline network and the method presented in this paper
    Fig. 12. Qualitatively compares the vehicle detection capabilities of the baseline network and the method presented in this paper
    模型教师模态mAP值(越大越好)中心距离(越小越好)
    RGB深度mAP@AvgmAP@0.5mAP@0.75CDxCDy
    StereoSoundNet6-44.0562.3841.463.002.24
    Baseline7-51.4569.2249.072.971.72
    -40.2854.0938.456.083.28
    51.9175.9247.132.071.11
    Ours-57.5777.0255.852.291.31
    -48.0463.5346.404.802.67
    62.2382.6361.491.951.05
    Table 1. Results comparison of the method and the baseline network under different faculty modes
    模型FPS/(FPS)模型FPS/(FPS)
    Faster R-CNN VGG1618.41Yolov3-m94.81
    Faster R-CNN ResNet13.15Yolov3-l66.89
    Yolov5-x(EfficientNet-B2)43.82Yolov5-s118.17
    SSD300(EfficientNet-B2)44.41Yolov5-m93.20
    SSD300121.39Yolov5-l67.04
    SSD50084.16Yolov5-x48.33
    Yolov3-s96.90Ours49.91
    Table 2. This paper compares the method with classical object detection networks
    模型损失mAP值中心距离
    MCFT LossLD LossmAP@AvgmAP@0.5mAP@0.75CDxCDy
    M1--52.6872.0550.042.691.51
    M2-55.9676.6854.872.511.41
    M3-62.3982.2361.381.981.08
    M462.2382.6361.491.951.05
    Table 3. Ablation studies for both losses
    超参数mAP值中心距离
    δβmAP@AvgmAP@0.5mAP@0.75CDxCDy
    1.00.00352.8872.8550.772.651.57
    1.00.00562.3982.2361.381.981.08
    1.00.00853.8672.4951.742.811.61
    1.00.0150.4369.1248.443.111.80
    1.00.0351.5569.5649.823.061.75
    1.00.0559.2978.9757.522.251.25
    1.01.049.9767.2447.873.221.82
    Table 4. 损失函数中超参数和的消融研究
    超参数mAP值中心距离
    δβλmAP@AvgmAP@0.5mAP@0.75CDxCDy
    1.00.0050.00550.1366.4048.273.522.06
    1.00.0050.0651.1770.8948.762.871.65
    1.00.0050.0152.2271.6749.802.861.64
    1.00.0050.2562.2382.6361.491.951.05
    1.00.0050.350.9570.2448.972.921.71
    1.00.0051.055.7978.1353.402.281.29
    Table 5. 损失函数中超参数,和的消融研究
    方法mAP值中心距离
    跨级融合方式损失计算方式mAP@AvgmAP@0.5mAP@0.75CDxCDy
    --KL51.9175.9247.132.071.11
    --L252.6872.0550.042.691.51
    -KL58.3678.0256.972.311.27
    -L256.1375.3354.742.671.48
    两两融合KL62.1581.8461.132.041.13
    两两融合L258.6877.9756.442.281.28
    堆叠融合KL62.3982.2361.381.981.08
    堆叠融合L261.7480.5460.452.051.11
    Table 6. Ablation studies with different fusion methods and loss calculation methods
    Shibei LIU, Ying CHEN. Audio object detection network with multimodal cross level feature knowledge transfer[J]. Optics and Precision Engineering, 2024, 32(2): 237
    Download Citation