[4] Nwe T L. Foo S W, de Silva L C. Speech emotion recognition using hidden Markov models[J]. Speech Communication, 41, 603-623(2003).
[5] Satt A, Rozenberg S, Hoory R. Efficient emotion recognition from speech using deep learning on spectrograms[J]. Proceedings of Interspeech, 2017, 1089-1093(2017).
[6] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. [C]∥Proceedings of the 27th International Conference on Neural Information Processing Systems, December 8-13, 2014, Montreal, Quebec, Canada. New York: Curran Associates, 2, 3104-3112(2014).
[7] Irsoy O, Cardie C. Deep recursive neural networks for compositionality in language. [C]∥ Proceedings of the 27th International Conference on Neural Information Processing Systems, December 8-13, 2014, Montreal, Quebec, Canada. New York: Curran Associates, 2, 2096-2104(2014).
[8] Lin Z H, Feng M W, dos Santos C N et al. -03-09)[2020-07-05]. https:∥arxiv., org/abs/1703, 03130(2017).
[9] Guo Z H, Zhang L, Zhang D. A completed modeling of local binary pattern operator for texture classification[J]. IEEE Transactions on Image Processing, 19, 1657-1663(2010).
[10] Zhang S Q, Li L M, Zhao Z J. Speech emotion recognition based on an improved supervised manifold learning algorithm[J]. Journal of Electronics & Information Technology, 32, 2724-2729(2010).
[11] Wang S, Wang W X, Zhao J M et al. Emotion recognition with multimodal features and temporal models[C]∥Proceedings of the 19th ACM International Conference on Multimodal Interaction-ICMI 2017, November 3-17, 2017, Glasgow, UK., 598-602(2017).
[12] Wu C H, Lin J C, Wei W L. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies[J]. APSIPA Transactions on Signal and Information Processing, 3, e12(2014).
[13] Abdel-Hamid O, Mohamed A R, Jiang H et al. Convolutional neural networks for speech recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 1533-1545(2014).
[14] Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 640-651(2017).
[15] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 9, 1735-1780(1997).
[16] Bahdanau D, Cho K. -05-19)[2020-07-05]. https:∥arxiv., org/abs/1409, 0473(2016).
[17] Fan Y, Lu X J, Li D et al. Video-based emotion recognition using CNN-RNN and C3D hybrid networks[C]∥Proceedings of the 18th ACM International Conference on Multimodal Interaction-ICMI 2016, October 31-November 16, 2016, Tokyo, , 445-450(2016).
[18] Nguyen D, Nguyen K, Sridharan S et al. Deep spatio-temporal features for multimodal emotion recognition[C]∥2017 IEEE Winter Conference on Applications of Computer Vision (WACV), March 24-31, 2017, Santa Rosa, CA, USA., 1215-1223(2017).
[19] Knyazev B, Shvetsov R, Efremova N et al. Leveraging large face recognition data for emotion classification[C]∥2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), May 15-19, 2018, Xi'an, China., 692-696(2018).
[20] Wang Y J, Guan L. Recognizing human emotional state from audiovisual signals[J]. IEEE Transactions on Multimedia, 10, 659-668(2008).
[21] DhallA, GoeckeR, JoshiJ, et al.EmotiW 2016: video and group-level emotion recognition challenges[C]∥Proceedings of the 18th ACM International Conference on Multimodal Interaction-ICMI 2016, October 31-November 16, 2016, Tokyo, Japan. New York: ACM Press, 2016: 427- 432.
[22] Martin O, Kotsia I, Macq B et al. The eNTERFACE'05 audio-visual emotion database[C]∥22nd International Conference on Data Engineering Workshops (ICDEW'06), April 3-7, 2006, Atlanta, GA, USA.(2006).
[23] Avots E, Sapiński T, Bachmann M et al. Audiovisual emotion recognition in wild[J]. Machine Vision and Applications, 30, 975-985(2019).
[24] Noroozi F, Marjanovic M, Njegus A et al. Audio-visual emotion recognition in video clips[J]. IEEE Transactions on Affective Computing, 10, 60-75(2017).
[25] Wang X S, Chen X, Cao C J. Human emotion recognition by optimally fusing facial expression and speech feature[J]. Signal Processing: Image Communication, 84, 115831(2020).
[26] Zhang YY, Wang ZR, DuJ. Deep fusion: an attention guided factorized bilinear pooling for audio-video emotion recognition[C]∥2019 International Joint Conference on Neural Networks (IJCNN), July 14-19, 2019, Budapest, Hungary. New York: IEEE Press, 2019.
[27] Dangol R, Alsadoon A. Prasad P W C, et al. Speech emotion recognition using convolutional neural network and long-short term memory[J]. Multimedia Tools and Applications, 79, 32917-32934(2020).