Hierarchical LSTM-Based Audio and Video Emotion Recognition With Embedded Attention Mechanism

Tianbao Liu; Lingtao Zhang; Wentao Yu; Dongchuan Wei; Yijun Fan

doi:10.3788/LOP202158.0210017

Journals >Laser & Optoelectronics Progress >Volume 58 >Issue 2 >Page 0210017 > Article

Laser & Optoelectronics Progress
Vol. 58, Issue 2, 0210017 (2021)

Hierarchical LSTM-Based Audio and Video Emotion Recognition With Embedded Attention Mechanism

Tianbao Liu, Lingtao Zhang^*, Wentao Yu, Dongchuan Wei, and Yijun Fan

Author Affiliations

College of Computer and Information Engineering, Central South University of Forestry and Technology, Changsha, Hunan 410004, China

show less

DOI: 10.3788/LOP202158.0210017 Cite this Article Set citation alerts

Tianbao Liu, Lingtao Zhang, Wentao Yu, Dongchuan Wei, Yijun Fan. Hierarchical LSTM-Based Audio and Video Emotion Recognition With Embedded Attention Mechanism[J]. Laser & Optoelectronics Progress, 2021, 58(2): 0210017 Copy Citation Text

show less

Abstract

A single-layer long short term memory (LSTM) network is not generalizable to solve complex speech emotion recognition problems. Therefore, a hierarchical LSTM model with a self-attention mechanism is proposed. Penalty items are introduced to improve network performance. For the emotion recognition of video sequences, the attention mechanism is introduced to assign a weight to each video frame according to its emotional information and then classify these frames. The weighted decision fusion method is used to fuse expressions and speech signals to achieve the final emotion recognition. The experimental results demonstrate that compared with single-modal emotion recognition, the recognition accuracy of the proposed method on the selected data is improved by approximately 4%, thus the proposed method has a better recognition results.

Keywords

attention mechanism emotion recognition fully convolutional neural network image processing long short term memory network multimodal fusion