SnapCap: efficient snapshot compressive scene captioning

Jianqiao Sun; Yudi Su; Hao Zhang; Ziheng Cheng; Zequn Zeng; Zhengjue Wang; Chunhui Qu; Bo Chen; Xin Yuan

doi:10.3788/AI.2025.10021

[1] V. Ramanishka et al. Top-down visual saliency guided by captions, 7206-7215(2017).

[2] Z. Zhang et al. From compressive sampling to compressive tasking: retrieving semantics in compressed domain with low bandwidth. PhotoniX, 3, 19(2022).

[3] B. Yang et al. Non-autoregressive coarse-to-fine video captioning, 3119-3127(2021).

[4] M. Tang et al. Clip4caption: Clip for video caption, 4858-4862(2021).

[5] Z. Cheng et al. BIRNAT: Bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging, 258-275(2020).

[6] L. Wang et al. Spatial-temporal transformer for video snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell., 45, 9072(2022).

[7] L. Wang, M. Cao, X. Yuan. Efficientsci: Densely connected network with space-time factorization for large-scale video snapshot compressive imaging, 18477-18486(2023).

[8] H. Ryu et al. Semantic grouping network for video captioning, 2514-2522(2021).

[9] X. Gu et al. Text with knowledge graph augmented transformer for video captioning, 18941-18951(2023).

[10] S. Chen, Y.-G. Jiang. Motion guided spatial attention for video captioning, 8191-8198(2019).

[11] X. Yuan, D. J. Brady, A. K. Katsaggelos. Snapshot compressive imaging: Theory, algorithms, and applications. IEEE Signal Process Mag., 38, 65(2021).

[12] P. Llull et al. Coded aperture compressive temporal imaging. Opt. Express, 21, 10526-10545(2013).

[13] M. Qiao et al. Deep learning for video compressive sensing. APL Photonics, 5, 030801(2020).

[14] X. Yuan et al. Low-cost compressive sensing for color video and depth, 3318-3325(2014).

[15] Y. Liu et al. Rank minimization for snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell., 41, 2990-3006(2018).

[16] X. Yuan. Generalized alternating projection based total variation minimization for compressive sensing, 2539-2543(2016).

[17] K. Dong et al. Retrieving object motions from coded shutter snapshot in dark environment. IEEE Trans. Image Process., 32, 3281-3294(2023).

[18] S. Kumawat et al. Action recognition from a single coded image. IEEE Trans. Pattern Anal. Mach. Intell., 45, 4109-4121(2022).

[19] X. Yuan et al. Plug-and-play algorithms for large-scale snapshot compressive imaging, 1447-1457(2020).

[20] Z. Wu, J. Zhangt, C. Mou. Dense deep unfolding network with 3D-CNN prior for snapshot compressive imaging, 4872-4881(2021).

[21] Z. Wu et al. Adaptive deep pnp algorithm for video snapshot compressive imaging. Int. J. Comput. Vis., 131, 1662-1679(2023).

[22] C. Hu et al. Video object detection from one single image through opto-electronic neural network. APL Photonics, 6, 046104(2021).

[23] B. Wang et al. Reconstruction network for video captioning, 7622-7631(2018).

[24] S. Chen, Y.-G. Jiang. Motion guided region message passing for video captioning, 1543-1552(2021).

[25] K. He et al. Deep residual learning for image recognition, 770-778(2016).

[26] C. Szegedy et al. Inception-v4, inception-resnet and the impact of residual connections on learning(2017).

[27] D. Tran et al. Learning spatiotemporal features with 3d convolutional networks, 4489-4497(2015).

[28] S. Jing et al. Memory-based augmentation network for video captioning. IEEE Trans. Multimedia, 26, 2367-2379(2023).

[29] B. Pan et al. Spatio-temporal graph for video captioning with knowledge distillation, 10870-10879(2020).

[30] H. Ye et al. Hierarchical modular network for video captioning, 17939-17948(2022).

[31] X. Zhong et al. Refined semantic enhancement towards frequency diffusion for video captioning, 3724-3732(2023).

[32] S. Liu et al. Bidirectional maximum entropy training with word co-occurrence for video captioning. IEEE Trans. Multimedia, 25, 4494-4507(2022).

[33] W. Xu et al. Deep reinforcement polishing network for video captioning. IEEE Trans. Multimedia, 23, 1772-1784(2020).

[34] P. H. Seo et al. End-to-end generative pretraining for multimodal video captioning, 17959-17968(2022).

[35] J. Wang et al. GIT: a generative image-to-text transformer for vision and language(2022).

[36] H. Luo et al. Univl: a unified video and language pre-training model for multimodal understanding and generation(2020).

[37] H. Xu et al. mplug-2: a modularized multi-modal foundation model across text, image and video, 38728-38748(2023).

[38] C. Schuhmann et al. Laion-400m: open dataset of clip-filtered 400 million image-text pairs(2021).

[39] A. Miech et al. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2630-2640(2019).

[40] M. Bain et al. Frozen in time: a joint video and image encoder for end-to-end retrieval, 1728-1738(2021).

[41] Y. Tewel et al. Zero-shot video captioning by evolving pseudo-tokens(2023).

[42] A. Radford et al. Learning transferable visual models from natural language supervision, 8748-8763(2021).

[43] Y. Liang et al. IcoCap: Improving video captioning by compounding images. IEEE Trans. Multimedia, 26, 4389-4400(2023).

[44] Y. Shen et al. Accurate and fast compressed video captioning, 15558-15567(2023).

[45] X. Jiao et al. Tinybert: Distilling bert for natural language understanding(2019).

[46] A. Yang et al. Context matters: distilling knowledge graph for enhanced object detection. IEEE Trans. Multimedia, 26, 487-500(2023).

[47] Q. Qi, Y. Yan, H. Wang. Class-aware dual-supervised aggregation network for video object detection. IEEE Trans. Multimedia, 26, 2109-2123(2023).

[48] K. Xu et al. Feature normalized knowledge distillation for image classification, 664-680(2020).

[49] X. Li et al. A category-aware curriculum learning for data-free knowledge distillation. IEEE Trans. Multimedia, 26, 9603-9618(2024).

[50] X. Wang et al. Kdgan: Knowledge distillation with generative adversarial networks(2018).

[51] M. Li et al. Gan compression: efficient architectures for interactive conditional gans, 5284-5294(2020).

[52] W. Zhu, B. Peng, W. Q. Yan. Dual knowledge distillation on multiview pseudo labels for unsupervised person re-identification. IEEE Trans. Multimedia, 26, 7359-7371(2024).

[53] J. Chen et al. Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only, 699-710(2023).

[54] K. Wu et al. Tinyclip: Clip distillation via affinity mimicking and weight inheritance, 21970-21980(2023).

[55] J. Chang et al. Detrdistill: a universal knowledge distillation framework for detr-families, 6898-6908(2023).

[56] J. Rao et al. Parameter-efficient and student-friendly knowledge distillation. IEEE Trans. Multimedia, 26, 4230-4241(2023).

[57] T. Zhang et al. Efficient RGB-T tracking via cross-modality distillation, 5404-5413(2023).

[58] S. Gupta et al. Cross modal distillation for supervision transfer, 2827-2836(2016).

[59] W. I. Cho et al. Speech to text adaptation: Towards an efficient cross-modal distillation(2020).

[60] D. Alvarez Melis, T. Jaakkola. Towards robust interpretability with self-explaining neural networks(2018).

[61] J. D. M.-W. C. Kenton, L. K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2(2019).

[62] J. Xu et al. Msr-vtt: A large video description dataset for bridging video and language, 5288-5296(2016).

[63] D. Chen, W. B. Dolan. Collecting highly parallel data for paraphrase evaluation, 190-200(2011).

微信扫一扫：分享

微信扫一扫：分享