Audio object detection network with multimodal cross level feature knowledge transfer

Shibei LIU; Ying CHEN

doi:10.37188/OPE.20243202.0237

Journals >Optics and Precision Engineering >Volume 32 >Issue 2 >Page 237 > Article

Optics and Precision Engineering
Vol. 32, Issue 2, 237 (2024)

Audio object detection network with multimodal cross level feature knowledge transfer

Shibei LIU and Ying CHEN^*

Author Affiliations

Key Laboratory of Advanced Process Control for Light Industry （Ministry of Education）， Jiangnan University， Wuxi214122， China

show less

DOI: 10.37188/OPE.20243202.0237 Cite this Article

Shibei LIU, Ying CHEN. Audio object detection network with multimodal cross level feature knowledge transfer[J]. Optics and Precision Engineering, 2024, 32(2): 237 Copy Citation Text

show less

Fig. 1. Schematic of RGB， depth and audio information

Download full size | View in the Article

Fig. 2. Multimodal knowledge distillation target detection network

Download full size | View in the Article

Fig. 3. Cross-level fusion and no cross-level feature heatmaps

Download full size | View in the Article

Fig. 4. Cross-level feature knowledge transfer loss based on attentional fusion

Download full size | View in the Article

Fig. 5. Attention fusion module（AFM） and the KL divergence calculation module（KLD）

Download full size | View in the Article

Fig. 6. Selection diagram of image and audio

Download full size | View in the Article

Fig. 7. Example images of MAVD dataset

Download full size | View in the Article

Fig. 8. Comparison of object detection capability under different network architecture

Download full size | View in the Article

Fig. 9. Schematic diagram of different fusion modes

Download full size | View in the Article

Fig. 10. Qualitative comparison of vehicle detection capability with or without LDLoss

Download full size | View in the Article

Fig. 11. Los curves for MTALoss and MCFTLoss

Download full size | View in the Article

Fig. 12. Qualitatively compares the vehicle detection capabilities of the baseline network and the method presented in this paper

Download full size | View in the Article

模型	教师模态		mAP值（越大越好）			中心距离（越小越好）
模型	RGB	深度	mAP@Avg	mAP@0.5	mAP@0.75	CDx	CDy
StereoSoundNet^［6］	√	-	44.05	62.38	41.46	3.00	2.24
Baseline^［7］	√	-	51.45	69.22	49.07	2.97	1.72
	-	√	40.28	54.09	38.45	6.08	3.28
	√	√	51.91	75.92	47.13	2.07	1.11
Ours	√	-	57.57	77.02	55.85	2.29	1.31
	-	√	48.04	63.53	46.40	4.80	2.67
	√	√	62.23	82.63	61.49	1.95	1.05

Table 1. Results comparison of the method and the baseline network under different faculty modes

View in the Article

模型	FPS/（FPS）	模型	FPS/（FPS）
Faster R-CNN VGG16	18.41	Yolov3-m	94.81
Faster R-CNN ResNet	13.15	Yolov3-l	66.89
Yolov5-x（EfficientNet-B2）	43.82	Yolov5-s	118.17
SSD300（EfficientNet-B2）	44.41	Yolov5-m	93.20
SSD300	121.39	Yolov5-l	67.04
SSD500	84.16	Yolov5-x	48.33
Yolov3-s	96.90	Ours	49.91

Table 2. This paper compares the method with classical object detection networks

View in the Article

模型	损失		mAP值			中心距离
模型	MCFT Loss	LD Loss	mAP@Avg	mAP@0.5	mAP@0.75	CDx	CDy
M1	-	-	52.68	72.05	50.04	2.69	1.51
M2	-	√	55.96	76.68	54.87	2.51	1.41
M3	√	-	62.39	82.23	61.38	1.98	1.08
M4	√	√	62.23	82.63	61.49	1.95	1.05

Table 3. Ablation studies for both losses

View in the Article

超参数		mAP值			中心距离
$δ$	$β$	mAP@Avg	mAP@0.5	mAP@0.75	CDx	CDy
1.0	0.003	52.88	72.85	50.77	2.65	1.57
1.0	0.005	62.39	82.23	61.38	1.98	1.08
1.0	0.008	53.86	72.49	51.74	2.81	1.61
1.0	0.01	50.43	69.12	48.44	3.11	1.80
1.0	0.03	51.55	69.56	49.82	3.06	1.75
1.0	0.05	59.29	78.97	57.52	2.25	1.25
1.0	1.0	49.97	67.24	47.87	3.22	1.82

Table 4. 损失函数中超参数和的消融研究

View in the Article

超参数			mAP值			中心距离
$δ$	$β$	$λ$	mAP@Avg	mAP@0.5	mAP@0.75	CDx	CDy
1.0	0.005	0.005	50.13	66.40	48.27	3.52	2.06
1.0	0.005	0.06	51.17	70.89	48.76	2.87	1.65
1.0	0.005	0.01	52.22	71.67	49.80	2.86	1.64
1.0	0.005	0.25	62.23	82.63	61.49	1.95	1.05
1.0	0.005	0.3	50.95	70.24	48.97	2.92	1.71
1.0	0.005	1.0	55.79	78.13	53.40	2.28	1.29

Table 5. 损失函数中超参数，和的消融研究

View in the Article

方法			mAP值			中心距离
跨级	融合方式	损失计算方式	mAP@Avg	mAP@0.5	mAP@0.75	CDx	CDy
-	-	KL	51.91	75.92	47.13	2.07	1.11
-	-	L2	52.68	72.05	50.04	2.69	1.51
√	-	KL	58.36	78.02	56.97	2.31	1.27
√	-	L2	56.13	75.33	54.74	2.67	1.48
√	两两融合	KL	62.15	81.84	61.13	2.04	1.13
√	两两融合	L2	58.68	77.97	56.44	2.28	1.28
√	堆叠融合	KL	62.39	82.23	61.38	1.98	1.08
√	堆叠融合	L2	61.74	80.54	60.45	2.05	1.11

Table 6. Ablation studies with different fusion methods and loss calculation methods

Shibei LIU, Ying CHEN. Audio object detection network with multimodal cross level feature knowledge transfer[J]. Optics and Precision Engineering, 2024, 32(2): 237

Download Citation

Tools

Save the article for my favorites

Paper Information

微信扫一扫：分享

微信扫一扫：分享