Multi-agent ad-hoc speech recognition

CHENJunqi; ZHANG Xiaolei

doi:10.11805/tkyda2021247

Abstract

Speech perception is an important part of unmanned systems. Most of the existing work focuses on the speech perception of a single agent, which is affected by factors such as noise and reverberation, and the performance has an upper limit. Therefore, it is necessary to study multi-agent speech perception, and improve perception performance through multi-agent self-organization and mutual cooperation. A multi-agent ad-hoc speech system is proposed under the assumption that each agent outputs a channel of speech stream. The multi-agent ad-hoc speech system aims to comprehensively utilize all channels to improve perception performance. Taking the speech recognition as an example, a channel selection method that can handle large-scale multi-agent speech recognition is proposed. Specifically, an end-to-end speech recognition stream attention mechanism based on Sparsemax operator is proposed to force the channel weights of noisy channels to zero, and make the stream attention bear the function of channel selection. Nevertheless, Sparsemax would punish the weights of many channels to zero harshly. Therefore, Scaling Sparsemax is proposed, which punishes the channels mildly by setting the weights of strong noise channels to zero only. At the same time, a multilayer stream attention structure is proposed to effectively reduce computational complexity. Experimental results in an unmanned system environment with up to 30 agents under the conformer speech recognition architecture show that the Word Error Rate(WER) of the proposed Scaling Sparsemax is lower than that of Softmax by over 30% on simulation data sets, and by over 20% on semi-real data sets, in test scenarios with mismatched channel numbers.