• Optics and Precision Engineering
  • Vol. 30, Issue 24, 3198 (2022)
Lin MAO, Hang GAO*, Dawei YANG, and Rubo ZHANG
Author Affiliations
  • School of Electromechanical Engineering, Dalian Minzu University, Dalian116600, China
  • show less
    DOI: 10.37188/OPE.20223024.3198 Cite this Article
    Lin MAO, Hang GAO, Dawei YANG, Rubo ZHANG. Chained semantic generation network for video captioning[J]. Optics and Precision Engineering, 2022, 30(24): 3198 Copy Citation Text show less
    References

    [1] 1汤鹏杰, 王瀚漓. 从视频到语言: 视频标题生成与描述研究综述[J]. 自动化学报, 2022, 48(2): 375-397.TANGP J, WANGH L. From video to language: survey of video captioning and description[J]. Acta Automatica Sinica, 2022, 48(2): 375-397.(in Chinese)

    [2] J X GU, Z H WANG, J KUEN et al. Recent advances in convolutional neural networks. Pattern Recognition, 77, 354-377(2018).

    [3] J L ELMAN. Finding structure in time. Cognitive Science, 14, 179-211(1990).

    [4] S VENUGOPALAN, M ROHRBACH, J DONAHUE et al. Sequence to sequence: video to text, 4534-4542(2015).

    [5] S HOCHREITER, J SCHMIDHUBER. Long Short-Term Memory. Neural Computation, 9, 1735-1780(1997).

    [6] 6陈科峻,张叶. 循环神经网络多标签航空图像分类[J]. 光学 精密工程,2020,28(6):1404-1413.CHENK J, ZHANGY. Recurrent neural network multi-label aerial images classification[J]. Opt.Precision Eng., 2020, 28(6): 1404-1413. (in Chinese)

    [7] S HOCHREITER, J SCHMIDHUBER. Long short-term memory. Neural Computation, 9, 1735-1780(1997).

    [8] C G YAN, Y B TU, X Z WANG et al. STAT: spatial-temporal attention mechanism for video captioning. IEEE Transactions on Multimedia, 22, 229-241(2020).

    [9] H R CHEN, K LIN, A MAYE et al. A semantics-assisted video captioning model trained with scheduled sampling. Frontiers in Robotics and AI, 7, 475767(2020).

    [10] Z Q ZHANG, Y Y SHI, C F YUAN et al. Object relational graph with teacher-recommended learning for video captioning, 13275-13285(2020).

    [11] W J PEI, J Y ZHANG, X R WANG et al. Memory-attended recurrent network for video captioning, 8339-8348(2019).

    [12] 12赵海英, 周伟, 侯小刚, 等. 多标签分类的传统民族服饰纹样图像语义理解[J]. 光学 精密工程, 2020, 28(3): 695-703. doi: 10.3788/OPE.20202803.0695ZHAOH Y, ZHOUW, HOUX G, et al. Multi-label classification of traditional national costume pattern image semantic understanding[J]. Opt. Precision Eng., 2020, 28(3): 695-703.(in Chinese). doi: 10.3788/OPE.20202803.0695

    [13] R SHETTY, J T LAAKSONEN. Video captioning with recurrent networks based on frame- and video-level features and visual content classification. arXiv preprint(02949).

    [14] Y B TU, X S ZHANG, B T LIU et al. Video description with spatial-temporal attention, 1014-1022(2017).

    [15] Z GAN, C GAN, X D HE et al. Semantic compositional networks for visual captioning, 1141-1150(2017).

    [16] 16杨其利, 周炳红, 郑伟, 等. 注意力卷积长短时记忆网络的弱小目标轨迹检测[J]. 光学 精密工程, 2020, 28(11): 2535-2548. doi: 10.37188/OPE.20202811.2535YANGQ L, ZHOUB H, ZHENGW, et al. Trajectory detection of small targets based on convolutional long short-term memory with attention mechanisms[J]. Opt. Precision Eng., 2020, 28(11): 2535-2548.(in Chinese). doi: 10.37188/OPE.20202811.2535

    [17] J XU, T MEI, T YAO et al. MSR-VTT: a large video description dataset for bridging video and language, 5288-5296(2016).

    [18] S GUADARRAMA, N KRISHNAMOORTHY, G MALKARNENKAR et al. YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition, 2712-2719(2013).

    [19] M ZOLFAGHARI, K SINGH, T BROX. ECO: efficient convolutional network for online video understanding, 713-730(2018).

    [20] S N XIE, R GIRSHICK, P DOLLÁR et al. Aggregated residual transformations for deep neural networks, 5987-5995(2017).

    [21] R PASUNURU, M BANSAL. Multi-task video captioning with video and entailment generation. Canada. Stroudsburg, 1273-1283(2017).

    [22] R PASUNURU, M BANSAL. Reinforced video captioning with entailment rewards, 979-985(2017).

    [23] S LIU, Z REN, J S YUAN. SibNet: sibling convolutional encoder for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 3259-3272(2021).

    [24] X WANG, Y F WANG, W Y WANG. Watch, listen, and describe: globally and locally aligned cross-modal attentions for video captioning, 795-801(2018).

    [25] X WANG, J W WU, D ZHANG et al. Learning to compose topic-aware mixture of experts for zero-shot video captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8965-8972(2019).

    [26] B R WANG, L MA, W ZHANG et al. Controllable video captioning with POS sequence guidance based on gated fusion network, 2641-2650(2019).

    [27] J Y HOU, X X WU, W T ZHAO et al. Joint syntax representation learning and visual cue translation for video captioning, 8917-8926(2019).

    [28] N AAFAQ, N AKHTAR, W LIU et al. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning, 12479-12488(2019).

    [29] B X PAN, H Y CAI, D A HUANG et al. Spatio-temporal graph for video captioning with knowledge distillation, 10867-10876(2020).

    [30] Q ZHENG, C Y WANG, D C TAO. Syntax-aware action targeting for video captioning, 13093-13102(2020).

    [31] Y W PAN, T MEI, T YAO et al. Jointly modeling embedding and translation to bridge video and language, 4594-4602(2016).

    [32] H N YU, J WANG, Z H HUANG et al. Video paragraph captioning using hierarchical recurrent neural networks, 4584-4593(2016).

    [33] L L GAO, Z GUO, H W ZHANG et al. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia, 19, 2045-2055(2017).

    Lin MAO, Hang GAO, Dawei YANG, Rubo ZHANG. Chained semantic generation network for video captioning[J]. Optics and Precision Engineering, 2022, 30(24): 3198
    Download Citation