Hierarchical LSTM-Based Audio and Video Emotion Recognition With Embedded Attention Mechanism

Tianbao Liu; Lingtao Zhang; Wentao Yu; Dongchuan Wei; Yijun Fan

doi:10.3788/LOP202158.0210017

[1] Yuan P P, Zhang L. Pedestrian attribute recognition based on deep learning[J]. Laser & Optoelectronics Progress, 57, 061001(2020).

[2] Liu F, Li M J, Hu J W et al. Expression recognition based on low pixel face images[J]. Laser & Optoelectronics Progress, 57, 101008(2020).

[3] Zhang Y C, Sun Z W. Identity authentication for smart phones based on an optimized convolutional deep belief network[J]. Laser & Optoelectronics Progress, 57, 081009(2020).

[4] Nwe T L. Foo S W, de Silva L C. Speech emotion recognition using hidden Markov models[J]. Speech Communication, 41, 603-623(2003).

[5] Satt A, Rozenberg S, Hoory R. Efficient emotion recognition from speech using deep learning on spectrograms[J]. Proceedings of Interspeech, 2017, 1089-1093(2017).

[6] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. [C]∥Proceedings of the 27th International Conference on Neural Information Processing Systems, December 8-13, 2014, Montreal, Quebec, Canada. New York: Curran Associates, 2, 3104-3112(2014).

[7] Irsoy O, Cardie C. Deep recursive neural networks for compositionality in language. [C]∥ Proceedings of the 27th International Conference on Neural Information Processing Systems, December 8-13, 2014, Montreal, Quebec, Canada. New York: Curran Associates, 2, 2096-2104(2014).

[8] Lin Z H, Feng M W, dos Santos C N et al. -03-09)[2020-07-05]. https:∥arxiv., org/abs/1703, 03130(2017).

[9] Guo Z H, Zhang L, Zhang D. A completed modeling of local binary pattern operator for texture classification[J]. IEEE Transactions on Image Processing, 19, 1657-1663(2010).

[10] Zhang S Q, Li L M, Zhao Z J. Speech emotion recognition based on an improved supervised manifold learning algorithm[J]. Journal of Electronics & Information Technology, 32, 2724-2729(2010).

[11] Wang S, Wang W X, Zhao J M et al. Emotion recognition with multimodal features and temporal models[C]∥Proceedings of the 19th ACM International Conference on Multimodal Interaction-ICMI 2017, November 3-17, 2017, Glasgow, UK., 598-602(2017).

[12] Wu C H, Lin J C, Wei W L. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies[J]. APSIPA Transactions on Signal and Information Processing, 3, e12(2014).

[13] Abdel-Hamid O, Mohamed A R, Jiang H et al. Convolutional neural networks for speech recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 1533-1545(2014).

[14] Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 640-651(2017).

[15] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 9, 1735-1780(1997).

[16] Bahdanau D, Cho K. -05-19)[2020-07-05]. https:∥arxiv., org/abs/1409, 0473(2016).

[17] Fan Y, Lu X J, Li D et al. Video-based emotion recognition using CNN-RNN and C3D hybrid networks[C]∥Proceedings of the 18th ACM International Conference on Multimodal Interaction-ICMI 2016, October 31-November 16, 2016, Tokyo, , 445-450(2016).

[18] Nguyen D, Nguyen K, Sridharan S et al. Deep spatio-temporal features for multimodal emotion recognition[C]∥2017 IEEE Winter Conference on Applications of Computer Vision (WACV), March 24-31, 2017, Santa Rosa, CA, USA., 1215-1223(2017).

[19] Knyazev B, Shvetsov R, Efremova N et al. Leveraging large face recognition data for emotion classification[C]∥2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), May 15-19, 2018, Xi'an, China., 692-696(2018).

[20] Wang Y J, Guan L. Recognizing human emotional state from audiovisual signals[J]. IEEE Transactions on Multimedia, 10, 659-668(2008).

[21] DhallA, GoeckeR, JoshiJ, et al.EmotiW 2016: video and group-level emotion recognition challenges[C]∥Proceedings of the 18th ACM International Conference on Multimodal Interaction-ICMI 2016, October 31-November 16, 2016, Tokyo, Japan. New York: ACM Press, 2016: 427- 432.

[22] Martin O, Kotsia I, Macq B et al. The eNTERFACE'05 audio-visual emotion database[C]∥22nd International Conference on Data Engineering Workshops (ICDEW'06), April 3-7, 2006, Atlanta, GA, USA.(2006).

[23] Avots E, Sapiński T, Bachmann M et al. Audiovisual emotion recognition in wild[J]. Machine Vision and Applications, 30, 975-985(2019).

[24] Noroozi F, Marjanovic M, Njegus A et al. Audio-visual emotion recognition in video clips[J]. IEEE Transactions on Affective Computing, 10, 60-75(2017).

[25] Wang X S, Chen X, Cao C J. Human emotion recognition by optimally fusing facial expression and speech feature[J]. Signal Processing: Image Communication, 84, 115831(2020).

[26] Zhang YY, Wang ZR, DuJ. Deep fusion: an attention guided factorized bilinear pooling for audio-video emotion recognition[C]∥2019 International Joint Conference on Neural Networks (IJCNN), July 14-19, 2019, Budapest, Hungary. New York: IEEE Press, 2019.

[27] Dangol R, Alsadoon A. Prasad P W C, et al. Speech emotion recognition using convolutional neural network and long-short term memory[J]. Multimedia Tools and Applications, 79, 32917-32934(2020).