【AIGC One Sentence Reading】:本文提出了一种基于多模态跨级特征知识转移的音频目标检测网络,通过自监督和知识蒸馏方法,有效提升了目标检测的准确性和鲁棒性。该网络融合了深度学习技术,优化了音频与视觉信息融合方式,实现了高效的目标定位。
【AIGC Short Abstract】:本文针对音频目标检测问题,提出了一种基于多模态跨级特征知识转移的网络。该方法通过自监督和知识蒸馏技术,有效融合多模态信息,提高了目标检测的鲁棒性和准确性。研究为音频目标定位领域提供了新的思路,具有实际应用价值。
Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.
Abstract
As one of the inherent properties of objects, sound can provide valuable information for target detection. At present, the method of target positioning only by monitoring environmental sound is less robust. To solve this problem, a multi-modal self-supervised target detection network under cross-level feature knowledge transfer was proposed. First of all, in view of the teachers network and students at the same characteristics of network learning ability of the limited problem, design based on the integration of teachers across level knowledge transfer loss, through the way of attention fusion deep and shallow characteristics of students, more efficient learning to the corresponding teacher middle layer characteristics, to extract more knowledge, combined with KL divergence, realize the alignment of teachers and students network alignment. In addition, in order to solve the problem of missing localization information, localization distillation loss was added, and more localization information was obtained by fitting the distribution of the teacher. With the network trained in the multimodal audiovisual detection MAVD dataset, the mAP values improve by 6.71%, 14.36% and 10.32% from the baseline network at IOU values of 0.5,0.75 and average, respectively. The experimental results demonstrate the superiority of this detection network.