Visual Question Answering based on multimodal bidirectional guided attention

XIAN Rong; HE Xiaohai; WU Xiaohong; QING Linbo

doi:10.11805/tkyda2020172

Abstract

Aiming at the problem that the existing deep collaborative attention models in the Visual Question Answering(VQA) task only consider the unidirectional attention of the question-guided image, which leads to the lack of interactivity of multimodal learning, a multimodal bidirectional guided attention network is proposed. The network consists of multimodal feature extraction module, bidirectional guided attention module, feature fusion module and classifier. The extracted image and question features are respectively output with weighted attention features after passing through layers of attention, and then the features are linearly merged into the softmax classifier to obtain the predicted answer to the question. Finally, the counting module is combined to improve the counting ability of the model. The results show that the model performs well on the public data set VQA v2.0, and obtains an overall classification accuracy of 70.77% and 71.28% on the test_dev and test_std, respectively, showing certain advantages compared with most advanced models.