Speaker Identification Based on Multimodal Long Short-Term Memory with Depth-Gate

Huangkang Chen; Ying Chen

doi:10.3788/LOP56.031007

[1] Kanagasundaram A, Vogt R, Dean D et al. I-vector based speaker recognition on short utterances. [C]∥Proceedings of the 12th Annual Conference of the International Speech Communication Association (ISCA), 2341-2344(2011).

[2] Matějka P, Glembek O, Castaldo F et al. Full-covariance UBM and heavy-tailed PLDA in I-vector speaker verification. [C]∥2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4828-4831(2011).

[3] Alam M R, Bennamoun M, Togneri R et al. A confidence-based late fusion framework for audio-visual biometric identification[J]. Pattern Recognition Letters, 52, 65-71(2015). http://dl.acm.org/citation.cfm?id=2776607

[4] Wu Z Y, Cai L H. Audio-visual bimodal speaker identification using dynamic bayesian networks[J]. Journal of Computer Research and Development, 43, 470-475(2006).

[5] Hu Y T, Ren J S, Dai J W et al. Deep multimodal speaker naming. [C]∥Proceedings of the 23rd ACM International Conference on Multimedia-MM'15, 1107-1110(2015).

[6] Geng J J, Liu X, Cheung Y M. Audio-visual speaker recognition via multi-modal correlated neural networks. [C]∥2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops (WIW), 123-128(2016).

[7] Wen M F, Hu C, Liu W R. Heterogeneous multimodal object recognition method based on deep learning[J]. Journal of Central South University (Science and Technology), 47, 1580-1587(2016).

[8] Ren J, Hu Y, Tai Y W et al. Look, listen and learn-a multimodal LSTM for speaker identification[C]. AAAI, 3581-3587(2016).

[9] YaoK, CohnT, VylomovaK, et al. Depth-gated recurrent neural networks[J]. arXiv:1508.03790, 2015.

[10] Hochreiter S, Schmidhuber J. Longshort-term memory[J]. Neural Computation, 9, 1735-1780(1997).

[11] Mikolov T, Karafi T M, Burget L et al. Recurrent neural network based language model. [C]∥Proceedings of the 11th Annual Conference of the International Speech Communication Association (ISCA), 1045-1048(2010).

[12] Sutskever I, Vinyals O[J]. Le Q V. Sequence to sequence learning with neural networks. arXiv, 3215v3, 2014(1409).

[13] Kalchbrenner N, Danihelka I[J]. Graves A. Grid long short-term memory. arXiv, 01526, 2015(1507).

[14] Hinton G E, Srivastava N, Krizhevsky A et al. Improving neural networks by preventing co-adaptation of feature detectors[J]. Computer Science, 3, 212-223(2012).

[15] Li Y J, Huang J J, Wang H Y et al. Study of emotion recognition based on fusion multi-modal bio-signal with SAE and LSTM recurrent neural network[J]. Journal on Communications, 38, 109-120(2017).

[16] Liu Y H, Liu X, Fan W T et al. Efficient audio-visual speaker recognition via deep heterogeneous feature fusion. [C]∥Chinese Conference on Biometric Recognition. Springer, Cham, 575-583(2017).

[17] Azab M, Wang M Z, Smith M et al. Speaker naming in movies. [C]∥Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, 2206-2216(2018).

[18] Yang H X, Chen Y, Zhang F et al. Face recognition based on improved gradient local binary pattern[J]. Laser & Optoelectronics Progress, 55, 061004(2018).