• Journal of Terahertz Science and Electronic Information Technology
  • Vol. 19, Issue 1, 156 (2021)
XIAN Rong*, HE Xiaohai, WU Xiaohong, and QING Linbo
Author Affiliations
  • [in Chinese]
  • show less
    DOI: 10.11805/tkyda2020172 Cite this Article
    XIAN Rong, HE Xiaohai, WU Xiaohong, QING Linbo. Visual Question Answering based on multimodal bidirectional guided attention[J]. Journal of Terahertz Science and Electronic Information Technology , 2021, 19(1): 156 Copy Citation Text show less

    Abstract

    Aiming at the problem that the existing deep collaborative attention models in the Visual Question Answering(VQA) task only consider the unidirectional attention of the question-guided image, which leads to the lack of interactivity of multimodal learning, a multimodal bidirectional guided attention network is proposed. The network consists of multimodal feature extraction module, bidirectional guided attention module, feature fusion module and classifier. The extracted image and question features are respectively output with weighted attention features after passing through layers of attention, and then the features are linearly merged into the softmax classifier to obtain the predicted answer to the question. Finally, the counting module is combined to improve the counting ability of the model. The results show that the model performs well on the public data set VQA v2.0, and obtains an overall classification accuracy of 70.77% and 71.28% on the test_dev and test_std, respectively, showing certain advantages compared with most advanced models.
    XIAN Rong, HE Xiaohai, WU Xiaohong, QING Linbo. Visual Question Answering based on multimodal bidirectional guided attention[J]. Journal of Terahertz Science and Electronic Information Technology , 2021, 19(1): 156
    Download Citation