[1] Kanagasundaram A, Vogt R, Dean D et al. I-vector based speaker recognition on short utterances. [C]∥Proceedings of the 12th Annual Conference of the International Speech Communication Association (ISCA), 2341-2344(2011).
[2] Matějka P, Glembek O, Castaldo F et al. Full-covariance UBM and heavy-tailed PLDA in I-vector speaker verification. [C]∥2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4828-4831(2011).
[5] Hu Y T, Ren J S, Dai J W et al. Deep multimodal speaker naming. [C]∥Proceedings of the 23rd ACM International Conference on Multimedia-MM'15, 1107-1110(2015).
[6] Geng J J, Liu X, Cheung Y M. Audio-visual speaker recognition via multi-modal correlated neural networks. [C]∥2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops (WIW), 123-128(2016).
[8] Ren J, Hu Y, Tai Y W et al. Look, listen and learn-a multimodal LSTM for speaker identification[C]. AAAI, 3581-3587(2016).
[9] YaoK, CohnT, VylomovaK, et al. Depth-gated recurrent neural networks[J]. arXiv:1508.03790, 2015.
[10] Hochreiter S, Schmidhuber J. Longshort-term memory[J]. Neural Computation, 9, 1735-1780(1997).
[11] Mikolov T, Karafi T M, Burget L et al. Recurrent neural network based language model. [C]∥Proceedings of the 11th Annual Conference of the International Speech Communication Association (ISCA), 1045-1048(2010).
[12] Sutskever I, Vinyals O[J]. Le Q V. Sequence to sequence learning with neural networks. arXiv, 3215v3, 2014(1409).
[13] Kalchbrenner N, Danihelka I[J]. Graves A. Grid long short-term memory. arXiv, 01526, 2015(1507).
[14] Hinton G E, Srivastava N, Krizhevsky A et al. Improving neural networks by preventing co-adaptation of feature detectors[J]. Computer Science, 3, 212-223(2012).
[16] Liu Y H, Liu X, Fan W T et al. Efficient audio-visual speaker recognition via deep heterogeneous feature fusion. [C]∥Chinese Conference on Biometric Recognition. Springer, Cham, 575-583(2017).
[17] Azab M, Wang M Z, Smith M et al. Speaker naming in movies. [C]∥Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, 2206-2216(2018).