• Laser & Optoelectronics Progress
  • Vol. 56, Issue 3, 031007 (2019)
Huangkang Chen* and Ying Chen**
Author Affiliations
  • Key Laboratory of Advanced Process Control for Light Industry of the Education Ministry of China, Jiangnan University, Wuxi, Jiangsu 214122, China
  • show less
    DOI: 10.3788/LOP56.031007 Cite this Article Set citation alerts
    Huangkang Chen, Ying Chen. Speaker Identification Based on Multimodal Long Short-Term Memory with Depth-Gate[J]. Laser & Optoelectronics Progress, 2019, 56(3): 031007 Copy Citation Text show less
    References

    [1] Kanagasundaram A, Vogt R, Dean D et al. I-vector based speaker recognition on short utterances. [C]∥Proceedings of the 12th Annual Conference of the International Speech Communication Association (ISCA), 2341-2344(2011).

    [2] Matějka P, Glembek O, Castaldo F et al. Full-covariance UBM and heavy-tailed PLDA in I-vector speaker verification. [C]∥2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4828-4831(2011).

    [3] Alam M R, Bennamoun M, Togneri R et al. A confidence-based late fusion framework for audio-visual biometric identification[J]. Pattern Recognition Letters, 52, 65-71(2015). http://dl.acm.org/citation.cfm?id=2776607

    [4] Wu Z Y, Cai L H. Audio-visual bimodal speaker identification using dynamic bayesian networks[J]. Journal of Computer Research and Development, 43, 470-475(2006).

    [5] Hu Y T, Ren J S, Dai J W et al. Deep multimodal speaker naming. [C]∥Proceedings of the 23rd ACM International Conference on Multimedia-MM'15, 1107-1110(2015).

    [6] Geng J J, Liu X, Cheung Y M. Audio-visual speaker recognition via multi-modal correlated neural networks. [C]∥2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops (WIW), 123-128(2016).

    [7] Wen M F, Hu C, Liu W R. Heterogeneous multimodal object recognition method based on deep learning[J]. Journal of Central South University (Science and Technology), 47, 1580-1587(2016).

    [8] Ren J, Hu Y, Tai Y W et al. Look, listen and learn-a multimodal LSTM for speaker identification[C]. AAAI, 3581-3587(2016).

    [9] YaoK, CohnT, VylomovaK, et al. Depth-gated recurrent neural networks[J]. arXiv:1508.03790, 2015.

    [10] Hochreiter S, Schmidhuber J. Longshort-term memory[J]. Neural Computation, 9, 1735-1780(1997).

    [11] Mikolov T, Karafi T M, Burget L et al. Recurrent neural network based language model. [C]∥Proceedings of the 11th Annual Conference of the International Speech Communication Association (ISCA), 1045-1048(2010).

    [12] Sutskever I, Vinyals O[J]. Le Q V. Sequence to sequence learning with neural networks. arXiv, 3215v3, 2014(1409).

    [13] Kalchbrenner N, Danihelka I[J]. Graves A. Grid long short-term memory. arXiv, 01526, 2015(1507).

    [14] Hinton G E, Srivastava N, Krizhevsky A et al. Improving neural networks by preventing co-adaptation of feature detectors[J]. Computer Science, 3, 212-223(2012).

    [15] Li Y J, Huang J J, Wang H Y et al. Study of emotion recognition based on fusion multi-modal bio-signal with SAE and LSTM recurrent neural network[J]. Journal on Communications, 38, 109-120(2017).

    [16] Liu Y H, Liu X, Fan W T et al. Efficient audio-visual speaker recognition via deep heterogeneous feature fusion. [C]∥Chinese Conference on Biometric Recognition. Springer, Cham, 575-583(2017).

    [17] Azab M, Wang M Z, Smith M et al. Speaker naming in movies. [C]∥Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, 2206-2216(2018).

    [18] Yang H X, Chen Y, Zhang F et al. Face recognition based on improved gradient local binary pattern[J]. Laser & Optoelectronics Progress, 55, 061004(2018).

    Huangkang Chen, Ying Chen. Speaker Identification Based on Multimodal Long Short-Term Memory with Depth-Gate[J]. Laser & Optoelectronics Progress, 2019, 56(3): 031007
    Download Citation