Chained semantic generation network for video captioning

Lin MAO; Hang GAO; Dawei YANG; Rubo ZHANG

doi:10.37188/OPE.20223024.3198

[1] 1汤鹏杰，王瀚漓. 从视频到语言：视频标题生成与描述研究综述［J］. 自动化学报， 2022， 48（2）： 375-397.TANGP J， WANGH L. From video to language： survey of video captioning and description［J］. Acta Automatica Sinica， 2022， 48（2）： 375-397.（in Chinese）

[2] J X GU, Z H WANG, J KUEN et al. Recent advances in convolutional neural networks. Pattern Recognition, 77, 354-377(2018).

[3] J L ELMAN. Finding structure in time. Cognitive Science, 14, 179-211(1990).

[4] S VENUGOPALAN, M ROHRBACH, J DONAHUE et al. Sequence to sequence： video to text, 4534-4542(2015).

[5] S HOCHREITER, J SCHMIDHUBER. Long Short-Term Memory. Neural Computation, 9, 1735-1780(1997).

[6] 6陈科峻，张叶. 循环神经网络多标签航空图像分类［J］. 光学精密工程，2020，28（6）：1404-1413.CHENK J， ZHANGY. Recurrent neural network multi-label aerial images classification［J］. Opt.Precision Eng.， 2020， 28（6）： 1404-1413. （in Chinese）

[7] S HOCHREITER, J SCHMIDHUBER. Long short-term memory. Neural Computation, 9, 1735-1780(1997).

[8] C G YAN, Y B TU, X Z WANG et al. STAT： spatial-temporal attention mechanism for video captioning. IEEE Transactions on Multimedia, 22, 229-241(2020).

[9] H R CHEN, K LIN, A MAYE et al. A semantics-assisted video captioning model trained with scheduled sampling. Frontiers in Robotics and AI, 7, 475767(2020).

[10] Z Q ZHANG, Y Y SHI, C F YUAN et al. Object relational graph with teacher-recommended learning for video captioning, 13275-13285(2020).

[11] W J PEI, J Y ZHANG, X R WANG et al. Memory-attended recurrent network for video captioning, 8339-8348(2019).

[12] 12赵海英，周伟，侯小刚，等. 多标签分类的传统民族服饰纹样图像语义理解［J］. 光学精密工程， 2020， 28（3）： 695-703. doi: 10.3788/OPE.20202803.0695ZHAOH Y， ZHOUW， HOUX G， et al. Multi-label classification of traditional national costume pattern image semantic understanding［J］. Opt. Precision Eng.， 2020， 28（3）： 695-703.（in Chinese）. doi: 10.3788/OPE.20202803.0695

[13] R SHETTY, J T LAAKSONEN. Video captioning with recurrent networks based on frame- and video-level features and visual content classification. arXiv preprint(02949).

[14] Y B TU, X S ZHANG, B T LIU et al. Video description with spatial-temporal attention, 1014-1022(2017).

[15] Z GAN, C GAN, X D HE et al. Semantic compositional networks for visual captioning, 1141-1150(2017).

[16] 16杨其利，周炳红，郑伟，等. 注意力卷积长短时记忆网络的弱小目标轨迹检测［J］. 光学精密工程， 2020， 28（11）： 2535-2548. doi: 10.37188/OPE.20202811.2535YANGQ L， ZHOUB H， ZHENGW， et al. Trajectory detection of small targets based on convolutional long short-term memory with attention mechanisms［J］. Opt. Precision Eng.， 2020， 28（11）： 2535-2548.（in Chinese）. doi: 10.37188/OPE.20202811.2535

[17] J XU, T MEI, T YAO et al. MSR-VTT： a large video description dataset for bridging video and language, 5288-5296(2016).

[18] S GUADARRAMA, N KRISHNAMOORTHY, G MALKARNENKAR et al. YouTube2Text： recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition, 2712-2719(2013).

[19] M ZOLFAGHARI, K SINGH, T BROX. ECO： efficient convolutional network for online video understanding, 713-730(2018).

[20] S N XIE, R GIRSHICK, P DOLLÁR et al. Aggregated residual transformations for deep neural networks, 5987-5995(2017).

[21] R PASUNURU, M BANSAL. Multi-task video captioning with video and entailment generation. Canada. Stroudsburg, 1273-1283(2017).

[22] R PASUNURU, M BANSAL. Reinforced video captioning with entailment rewards, 979-985(2017).

[23] S LIU, Z REN, J S YUAN. SibNet： sibling convolutional encoder for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 3259-3272(2021).

[24] X WANG, Y F WANG, W Y WANG. Watch， listen， and describe： globally and locally aligned cross-modal attentions for video captioning, 795-801(2018).

[25] X WANG, J W WU, D ZHANG et al. Learning to compose topic-aware mixture of experts for zero-shot video captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8965-8972(2019).

[26] B R WANG, L MA, W ZHANG et al. Controllable video captioning with POS sequence guidance based on gated fusion network, 2641-2650(2019).

[27] J Y HOU, X X WU, W T ZHAO et al. Joint syntax representation learning and visual cue translation for video captioning, 8917-8926(2019).

[28] N AAFAQ, N AKHTAR, W LIU et al. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning, 12479-12488(2019).

[29] B X PAN, H Y CAI, D A HUANG et al. Spatio-temporal graph for video captioning with knowledge distillation, 10867-10876(2020).

[30] Q ZHENG, C Y WANG, D C TAO. Syntax-aware action targeting for video captioning, 13093-13102(2020).

[31] Y W PAN, T MEI, T YAO et al. Jointly modeling embedding and translation to bridge video and language, 4594-4602(2016).

[32] H N YU, J WANG, Z H HUANG et al. Video paragraph captioning using hierarchical recurrent neural networks, 4584-4593(2016).

[33] L L GAO, Z GUO, H W ZHANG et al. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia, 19, 2045-2055(2017).

微信扫一扫：分享

微信扫一扫：分享