• Advanced Imaging
  • Vol. 2, Issue 1, 011003 (2025)
Jianqiao Sun1, Yudi Su1, Hao Zhang1,*, Ziheng Cheng1..., Zequn Zeng1, Zhengjue Wang2, Chunhui Qu1, Bo Chen1,* and Xin Yuan3,*|Show fewer author(s)
Author Affiliations
  • 1National Key Laboratory of Radar Signal Processing, Xidian University, Xi’an, China
  • 2School of Telecommunications Engineering, Xidian University, Xi’an, China
  • 3School of Engineering, Westlake University, Hangzhou, China
  • show less
    DOI: 10.3788/AI.2025.10021 Cite this Article Set citation alerts
    Jianqiao Sun, Yudi Su, Hao Zhang, Ziheng Cheng, Zequn Zeng, Zhengjue Wang, Chunhui Qu, Bo Chen, Xin Yuan, "SnapCap: efficient snapshot compressive scene captioning," Adv. Imaging 2, 011003 (2025) Copy Citation Text show less
    References

    [1] V. Ramanishka et al. Top-down visual saliency guided by captions, 7206-7215(2017).

    [2] Z. Zhang et al. From compressive sampling to compressive tasking: retrieving semantics in compressed domain with low bandwidth. PhotoniX, 3, 19(2022).

    [3] B. Yang et al. Non-autoregressive coarse-to-fine video captioning, 3119-3127(2021).

    [4] M. Tang et al. Clip4caption: Clip for video caption, 4858-4862(2021).

    [5] Z. Cheng et al. BIRNAT: Bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging, 258-275(2020).

    [6] L. Wang et al. Spatial-temporal transformer for video snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell., 45, 9072(2022).

    [7] L. Wang, M. Cao, X. Yuan. Efficientsci: Densely connected network with space-time factorization for large-scale video snapshot compressive imaging, 18477-18486(2023).

    [8] H. Ryu et al. Semantic grouping network for video captioning, 2514-2522(2021).

    [9] X. Gu et al. Text with knowledge graph augmented transformer for video captioning, 18941-18951(2023).

    [10] S. Chen, Y.-G. Jiang. Motion guided spatial attention for video captioning, 8191-8198(2019).

    [11] X. Yuan, D. J. Brady, A. K. Katsaggelos. Snapshot compressive imaging: Theory, algorithms, and applications. IEEE Signal Process Mag., 38, 65(2021).

    [12] P. Llull et al. Coded aperture compressive temporal imaging. Opt. Express, 21, 10526-10545(2013).

    [13] M. Qiao et al. Deep learning for video compressive sensing. APL Photonics, 5, 030801(2020).

    [14] X. Yuan et al. Low-cost compressive sensing for color video and depth, 3318-3325(2014).

    [15] Y. Liu et al. Rank minimization for snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell., 41, 2990-3006(2018).

    [16] X. Yuan. Generalized alternating projection based total variation minimization for compressive sensing, 2539-2543(2016).

    [17] K. Dong et al. Retrieving object motions from coded shutter snapshot in dark environment. IEEE Trans. Image Process., 32, 3281-3294(2023).

    [18] S. Kumawat et al. Action recognition from a single coded image. IEEE Trans. Pattern Anal. Mach. Intell., 45, 4109-4121(2022).

    [19] X. Yuan et al. Plug-and-play algorithms for large-scale snapshot compressive imaging, 1447-1457(2020).

    [20] Z. Wu, J. Zhangt, C. Mou. Dense deep unfolding network with 3D-CNN prior for snapshot compressive imaging, 4872-4881(2021).

    [21] Z. Wu et al. Adaptive deep pnp algorithm for video snapshot compressive imaging. Int. J. Comput. Vis., 131, 1662-1679(2023).

    [22] C. Hu et al. Video object detection from one single image through opto-electronic neural network. APL Photonics, 6, 046104(2021).

    [23] B. Wang et al. Reconstruction network for video captioning, 7622-7631(2018).

    [24] S. Chen, Y.-G. Jiang. Motion guided region message passing for video captioning, 1543-1552(2021).

    [25] K. He et al. Deep residual learning for image recognition, 770-778(2016).

    [26] C. Szegedy et al. Inception-v4, inception-resnet and the impact of residual connections on learning(2017).

    [27] D. Tran et al. Learning spatiotemporal features with 3d convolutional networks, 4489-4497(2015).

    [28] S. Jing et al. Memory-based augmentation network for video captioning. IEEE Trans. Multimedia, 26, 2367-2379(2023).

    [29] B. Pan et al. Spatio-temporal graph for video captioning with knowledge distillation, 10870-10879(2020).

    [30] H. Ye et al. Hierarchical modular network for video captioning, 17939-17948(2022).

    [31] X. Zhong et al. Refined semantic enhancement towards frequency diffusion for video captioning, 3724-3732(2023).

    [32] S. Liu et al. Bidirectional maximum entropy training with word co-occurrence for video captioning. IEEE Trans. Multimedia, 25, 4494-4507(2022).

    [33] W. Xu et al. Deep reinforcement polishing network for video captioning. IEEE Trans. Multimedia, 23, 1772-1784(2020).

    [34] P. H. Seo et al. End-to-end generative pretraining for multimodal video captioning, 17959-17968(2022).

    [35] J. Wang et al. GIT: a generative image-to-text transformer for vision and language(2022).

    [36] H. Luo et al. Univl: a unified video and language pre-training model for multimodal understanding and generation(2020).

    [37] H. Xu et al. mplug-2: a modularized multi-modal foundation model across text, image and video, 38728-38748(2023).

    [38] C. Schuhmann et al. Laion-400m: open dataset of clip-filtered 400 million image-text pairs(2021).

    [39] A. Miech et al. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2630-2640(2019).

    [40] M. Bain et al. Frozen in time: a joint video and image encoder for end-to-end retrieval, 1728-1738(2021).

    [41] Y. Tewel et al. Zero-shot video captioning by evolving pseudo-tokens(2023).

    [42] A. Radford et al. Learning transferable visual models from natural language supervision, 8748-8763(2021).

    [43] Y. Liang et al. IcoCap: Improving video captioning by compounding images. IEEE Trans. Multimedia, 26, 4389-4400(2023).

    [44] Y. Shen et al. Accurate and fast compressed video captioning, 15558-15567(2023).

    [45] X. Jiao et al. Tinybert: Distilling bert for natural language understanding(2019).

    [46] A. Yang et al. Context matters: distilling knowledge graph for enhanced object detection. IEEE Trans. Multimedia, 26, 487-500(2023).

    [47] Q. Qi, Y. Yan, H. Wang. Class-aware dual-supervised aggregation network for video object detection. IEEE Trans. Multimedia, 26, 2109-2123(2023).

    [48] K. Xu et al. Feature normalized knowledge distillation for image classification, 664-680(2020).

    [49] X. Li et al. A category-aware curriculum learning for data-free knowledge distillation. IEEE Trans. Multimedia, 26, 9603-9618(2024).

    [50] X. Wang et al. Kdgan: Knowledge distillation with generative adversarial networks(2018).

    [51] M. Li et al. Gan compression: efficient architectures for interactive conditional gans, 5284-5294(2020).

    [52] W. Zhu, B. Peng, W. Q. Yan. Dual knowledge distillation on multiview pseudo labels for unsupervised person re-identification. IEEE Trans. Multimedia, 26, 7359-7371(2024).

    [53] J. Chen et al. Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only, 699-710(2023).

    [54] K. Wu et al. Tinyclip: Clip distillation via affinity mimicking and weight inheritance, 21970-21980(2023).

    [55] J. Chang et al. Detrdistill: a universal knowledge distillation framework for detr-families, 6898-6908(2023).

    [56] J. Rao et al. Parameter-efficient and student-friendly knowledge distillation. IEEE Trans. Multimedia, 26, 4230-4241(2023).

    [57] T. Zhang et al. Efficient RGB-T tracking via cross-modality distillation, 5404-5413(2023).

    [58] S. Gupta et al. Cross modal distillation for supervision transfer, 2827-2836(2016).

    [59] W. I. Cho et al. Speech to text adaptation: Towards an efficient cross-modal distillation(2020).

    [60] D. Alvarez Melis, T. Jaakkola. Towards robust interpretability with self-explaining neural networks(2018).

    [61] J. D. M.-W. C. Kenton, L. K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2(2019).

    [62] J. Xu et al. Msr-vtt: A large video description dataset for bridging video and language, 5288-5296(2016).

    [63] D. Chen, W. B. Dolan. Collecting highly parallel data for paraphrase evaluation, 190-200(2011).

    Jianqiao Sun, Yudi Su, Hao Zhang, Ziheng Cheng, Zequn Zeng, Zhengjue Wang, Chunhui Qu, Bo Chen, Xin Yuan, "SnapCap: efficient snapshot compressive scene captioning," Adv. Imaging 2, 011003 (2025)
    Download Citation