[1] V. Ramanishka et al. Top-down visual saliency guided by captions, 7206-7215(2017).
[3] B. Yang et al. Non-autoregressive coarse-to-fine video captioning, 3119-3127(2021).
[4] M. Tang et al. Clip4caption: Clip for video caption, 4858-4862(2021).
[5] Z. Cheng et al. BIRNAT: Bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging, 258-275(2020).
[7] L. Wang, M. Cao, X. Yuan. Efficientsci: Densely connected network with space-time factorization for large-scale video snapshot compressive imaging, 18477-18486(2023).
[8] H. Ryu et al. Semantic grouping network for video captioning, 2514-2522(2021).
[9] X. Gu et al. Text with knowledge graph augmented transformer for video captioning, 18941-18951(2023).
[10] S. Chen, Y.-G. Jiang. Motion guided spatial attention for video captioning, 8191-8198(2019).
[13] M. Qiao et al. Deep learning for video compressive sensing. APL Photonics, 5, 030801(2020).
[14] X. Yuan et al. Low-cost compressive sensing for color video and depth, 3318-3325(2014).
[16] X. Yuan. Generalized alternating projection based total variation minimization for compressive sensing, 2539-2543(2016).
[19] X. Yuan et al. Plug-and-play algorithms for large-scale snapshot compressive imaging, 1447-1457(2020).
[20] Z. Wu, J. Zhangt, C. Mou. Dense deep unfolding network with 3D-CNN prior for snapshot compressive imaging, 4872-4881(2021).
[23] B. Wang et al. Reconstruction network for video captioning, 7622-7631(2018).
[24] S. Chen, Y.-G. Jiang. Motion guided region message passing for video captioning, 1543-1552(2021).
[25] K. He et al. Deep residual learning for image recognition, 770-778(2016).
[26] C. Szegedy et al. Inception-v4, inception-resnet and the impact of residual connections on learning(2017).
[27] D. Tran et al. Learning spatiotemporal features with 3d convolutional networks, 4489-4497(2015).
[29] B. Pan et al. Spatio-temporal graph for video captioning with knowledge distillation, 10870-10879(2020).
[30] H. Ye et al. Hierarchical modular network for video captioning, 17939-17948(2022).
[31] X. Zhong et al. Refined semantic enhancement towards frequency diffusion for video captioning, 3724-3732(2023).
[34] P. H. Seo et al. End-to-end generative pretraining for multimodal video captioning, 17959-17968(2022).
[35] J. Wang et al. GIT: a generative image-to-text transformer for vision and language(2022).
[36] H. Luo et al. Univl: a unified video and language pre-training model for multimodal understanding and generation(2020).
[37] H. Xu et al. mplug-2: a modularized multi-modal foundation model across text, image and video, 38728-38748(2023).
[38] C. Schuhmann et al. Laion-400m: open dataset of clip-filtered 400 million image-text pairs(2021).
[39] A. Miech et al. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2630-2640(2019).
[40] M. Bain et al. Frozen in time: a joint video and image encoder for end-to-end retrieval, 1728-1738(2021).
[41] Y. Tewel et al. Zero-shot video captioning by evolving pseudo-tokens(2023).
[42] A. Radford et al. Learning transferable visual models from natural language supervision, 8748-8763(2021).
[44] Y. Shen et al. Accurate and fast compressed video captioning, 15558-15567(2023).
[45] X. Jiao et al. Tinybert: Distilling bert for natural language understanding(2019).
[48] K. Xu et al. Feature normalized knowledge distillation for image classification, 664-680(2020).
[50] X. Wang et al. Kdgan: Knowledge distillation with generative adversarial networks(2018).
[51] M. Li et al. Gan compression: efficient architectures for interactive conditional gans, 5284-5294(2020).
[53] J. Chen et al. Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only, 699-710(2023).
[54] K. Wu et al. Tinyclip: Clip distillation via affinity mimicking and weight inheritance, 21970-21980(2023).
[55] J. Chang et al. Detrdistill: a universal knowledge distillation framework for detr-families, 6898-6908(2023).
[57] T. Zhang et al. Efficient RGB-T tracking via cross-modality distillation, 5404-5413(2023).
[58] S. Gupta et al. Cross modal distillation for supervision transfer, 2827-2836(2016).
[59] W. I. Cho et al. Speech to text adaptation: Towards an efficient cross-modal distillation(2020).
[60] D. Alvarez Melis, T. Jaakkola. Towards robust interpretability with self-explaining neural networks(2018).
[61] J. D. M.-W. C. Kenton, L. K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2(2019).
[62] J. Xu et al. Msr-vtt: A large video description dataset for bridging video and language, 5288-5296(2016).
[63] D. Chen, W. B. Dolan. Collecting highly parallel data for paraphrase evaluation, 190-200(2011).