Visual Question Answering based on multimodal bidirectional guided attention

XIAN Rong; HE Xiaohai; WU Xiaohong; QING Linbo

doi:10.11805/tkyda2020172

[1] ANTOL S,AGRAWAL A,LU J,et al. VQA:Visual Question Answering[J]. International Journal of Computer Vision, 2017, 123(1):4-31.

[2] WANG L,LI Y,HUANG J,et al. Learning two-branch neural networks for image-text matching tasks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019,41(2):394-407.

[3] ANDERSON P,HE X,BUEHLER C,et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City,USA:IEEE, 2018: 6077-6086.

[4] WU Q,SHEN C,WANG P,et al. Image captioning and visual question answering based on attributes and external knowledge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018,40(6):1367-1381.

[5] SHIH K J,SINGH S,HOIEM D,et al. Where to look:focus regions for visual question answering[C]// Computer Vision and Pattern Recognition. Las Vegas,US:IEEE, 2016:4613-4621.

[6] PENG L,YANG Y,BIN Y,et al. Word-to-region attention network for visual question answering[J]. Multimedia Tools and Applications, 2019,78(3):3843-3858.

[7] LU J,YANG J,BATRA D,et al. Hierarchical question-image co-attention for visual question answering[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona,Spain:Curran Associates, 2016: 289-297.

[8] YANG C,JIANG M,JIANG B,et al. Co-attention network with question type for visual question answering[J]. IEEE Access, 2019(7):40771-40781.

[9] KIM J,JUN J,ZHANG B,et al. Bilinear attention networks[C]// Neural Information Processing Systems. Montreal,Canada: Curran Associates, 2018:1564-1574.

[10] YU Z,YU J,CUI Y,et al. Deep modular co-attention networks for visual question answering[C]// Computer Vision and Pattern Recognition. Long Beach,CA,US:IEEE, 2019:6281-6290.

[11] ZHANG Y,HARE J,PRUGEL-BENNETT A,et al. Learning to count objects in natural images for visual question answering[C]// International Conference on Learning Representations. Vancouver,Canada:[s.n.], 2018.

[12] VASWANI A,SHAZEER N,PARMAR N,et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach,USA:Curran Associates, 2017:5998-6008.

[14] TENEY D,ANDERSON P,HE X,et al. Tips and tricks for visual question answering:learnings from the 2017 challenge[C]// IEEE/CVF Computer Vision and Pattern Recognition. Salt Lake City,USA:IEEE, 2018:4223-4232.

[15] GAO P,YOU H,ZHANG Z,et al. Multi-modality latent interaction network for visual question answering[C]// International Conference on Computer Vision. Seoul,Korea(South):IEEE, 2019:5825-5835.

[16] GAO P,JIANG Z,YOU H,et al. Dynamic fusion with intra-and inter-modality attention flow for visual question answering[C]// Computer Vision and Pattern Recognition. Long Beach,CA,USA:IEEE, 2019:6639-6648.