Image Semantic Description Algorithm with Integrated Spatial Attention Mechanism

Lie Guo; Tuanshan Zhang; Weizhen Sun; Jielong Guo

doi:10.3788/LOP202158.1210030

[1] Dalal N, Triggs B. Histograms of oriented gradients for human detection[C]. //2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), June 20-25, 2005, San Diego, CA, USA., 886-893(2005).

[2] Fang H, Gupta S, Iandola F et al. From captions to visual concepts and back[C]. //2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 7-12, 2015, Boston, MA, USA., 1473-1482(2015).

[3] Xu K, Ba J, Kiros R et al. Show, attend and tell: neural image caption generation with visual attention[EB/OL]. (2016-04-19)[2020-09-04]. https://arxiv.org/abs/1502.03044

[4] He K M, Zhang X Y, Ren S Q et al. Deep residual learning for image recognition[C]. //2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 27-30, 2016, Las Vegas, NV, USA., 770-778(2016).

[5] Tao Z Y, Li J, Tang X L. Texture images classification algorithm combining wavelet transform and capsule network[J]. Laser & Optoelectronics Progress, 57, 241002(2020).

[6] Szegedy C, Ioffe S, Vanhoucke V et al. Inception-v4, inception-resnet and the impact of residual connections on learning[C]. //Thirty-first AAAI conference on artificial intelligence, February 4-9, 2017, San Francisco, California, USA, 4278-4284(2017).

[7] Huang G, Liu Z, van der Maaten L et al. Densely connected convolutional networks[C]. //2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 21-26, 2017, Honolulu, HI, USA., 2261-2269(2017).

[8] You Q Z, Jin H L, Wang Z W et al. Image captioning with semantic attention[C]. //2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 27-30, 2016, Las Vegas, NV, USA., 4651-4659(2016).

[9] Huang L, Wang W M, Xia Y X et al. Adaptively aligned image captioning via adaptive attention time[EB/OL]. (2020-01-06)[2020-09-04]. https://arxiv.org/abs/1909.09060v3

[10] Huang L, Wang W M, Chen J et al. Attention on attention for image captioning[C]. //2019 IEEE/CVF International Conference on Computer Vision (ICCV), October 27-November 2, 2019, Seoul, Korea (South)., 4633-4642(2019).

[11] Liu F L, Liu Y X, Ren X C et al. Aligning visual regions and textual concepts for semantic-grounded image representations[EB/OL]. (2019-11-04)[2020-09-04]. https://arxiv.org/abs/1905.06139v3

[12] Yang X, Tang K H, Zhang H W et al. Auto-encoding scene graphs for image captioning[C]. //2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 15-20, 2019, Long Beach, CA, USA., 10677-10686(2019).

[13] Zhao X H, Yin L F, Zhao C L. Image captioning based on global-local feature and adaptive-attention[J]. Journal of Zhejiang University (Engineering Science), 54, 126-134(2020).

[14] Lu J S, Xiong C M, Parikh D et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning[C]. //2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 21-26, 2017, Honolulu, HI, USA, 3242-3250(2017).

[15] Yang Z C, He X D, Gao J F et al. Stacked attention networks for image question answering[C]. //2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 27-30, 2016, Las Vegas, NV, USA., 21-29(2016).

[16] Xu H J, Saenko K. Ask, attend and answer: exploring question-guided spatial attention for visual question answering[EB/OL]. (2015-11-17)[2020-09-04]. https://arxiv.org/abs/1511.05234v1

[17] Zhu Y K, Groth O, Bernstein M et al. Visual7W: grounded question answering in images[C]. //2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 27-30, 2016, Las Vegas, NV, USA, 4995-5004(2016).

[18] Dong Y F, Yang Y X, Wang L Q. Image semantic segmentation based on multi-scale feature extraction and fully connected conditional random fields[J]. Laser & Optoelectronics Progress, 56, 131007(2019).

[19] Yue S Y. Image semantic segmentation based on hierarchical context information[J]. Laser & Optoelectronics Progress, 56, 241005(2019).

[20] Wang Y H. Image caption based on multi-fusion model[J]. Henan Science and Technology, 34-36(2019).

[21] Vinyals O, Toshev A, Bengio S et al. Show and tell: a neural image caption generator[C]. //2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 7-12, 2015, Boston, MA, USA, 3156-3164(2015).

[22] Vinyals O, Toshev A, Bengio S et al. Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 652-663(2017). http://doi.ieeecomputersociety.org/10.1109/TPAMI.2016.2587640

[23] Li R F, Liang H Y, Feng F X et al. Paragraph image captioning with deep fully convolutional neural networks[J]. Journal of Beijing University of Posts and Telecommunications, 42, 155-161(2019).

[24] Karpathy A, Li F F. Deep visual-semantic alignments for generating image descriptions[C]. //2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 7-12, 2015, Boston, MA, USA., 3128-3137(2015).

[25] Chen L, Zhang H W, Xiao J et al. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning[C]. //2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 21-26, 2017, Honolulu, HI, USA, 6298-6306(2017).

[26] Ren Z, Wang X Y, Zhang N et al. Deep reinforcement learning-based image captioning with embedding reward[C]. //2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 21-26, 2017, Honolulu, HI., 1151-1159(2017).

[27] Bengio S, Vinyals O, Jaitly N et al. Scheduled sampling for sequence prediction with recurrent neural networks[EB/OL]. (2015-09-23)[2020-09-04]. https://arxiv.org/abs/1506.03099

[28] Chung J Y, Gulcehre C, Cho K H et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL]. (2014-12-11)[2020-09-04]. https://arxiv.org/abs/1412.3555v1

[29] Luong M T, Pham H, Manning C D. Effective approaches to attention-based neural machine translation[EB/OL]. (2015-09-20)[2020-09-04]. https://arxiv.org/abs/1508.04025

[30] Corbetta M, Shulman G L. Control of goal-directed and stimulus-driven attention in the brain[J]. Nature Reviews Neuroscience, 3, 201-215(2002). http://www.nature.com/nrn/journal/v3/n3/full/nrn755.html

[31] Lin T Y, Maire M, Belongie S et al. Microsoft COCO: common objects in context[M]. //Fleet D, Pajdla T, Schiele B, et al. Computer vision-ECCV 2014, 8693, 740-755(2014).

[32] Young P, Lai A, Hodosh M et al. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2, 67-78(2014). http://www.researchgate.net/publication/303721259_From_image_descriptions_to_visual_denotations_New_similarity_metrics_for_semantic_inference_over_event_descriptions

[33] Papineni K, Roukos S, Ward T et al. BLEU: a method for automatic evaluation of machine translation[C]. //Proceedings of the 40th Annual Meeting on Association for Computational Linguistics-ACL’02, July 7-12, 2002, Philadelphia, Pennsylvania(2002).

[34] Kingma D P, Ba J. Adam: a method for stochastic optimization[EB/OL]. (2017-01-30)[2020-09-04]. https://arxiv.org/abs/1412.6980v9.

[35] Mao J H, Xu W, Yang Y et al. Deep captioning with multimodal recurrent neural networks (m-RNN)[EB/OL]. (2015-06-11)[2020-09-04]. https://arxiv.org/abs/1412.6632v2.