• Advanced Imaging
  • Vol. 2, Issue 1, 011003 (2025)
Jianqiao Sun1, Yudi Su1, Hao Zhang1,*, Ziheng Cheng1..., Zequn Zeng1, Zhengjue Wang2, Chunhui Qu1, Bo Chen1,* and Xin Yuan3,*|Show fewer author(s)
Author Affiliations
  • 1National Key Laboratory of Radar Signal Processing, Xidian University, Xi’an, China
  • 2School of Telecommunications Engineering, Xidian University, Xi’an, China
  • 3School of Engineering, Westlake University, Hangzhou, China
  • show less
    DOI: 10.3788/AI.2025.10021 Cite this Article Set citation alerts
    Jianqiao Sun, Yudi Su, Hao Zhang, Ziheng Cheng, Zequn Zeng, Zhengjue Wang, Chunhui Qu, Bo Chen, Xin Yuan, "SnapCap: efficient snapshot compressive scene captioning," Adv. Imaging 2, 011003 (2025) Copy Citation Text show less

    Abstract

    Describing a scene in language is a challenging multi-modal task as it requires understanding various and complex scenes, and then transforming them into sentences. Among these scenes, the task of video captioning (VC) has attracted much attention from researchers. For machines, traditional VC follows the “imaging-compression-decoding-and-then-captioning” pipeline, where compression is a pivot for storage and transmission. However, in such a pipeline, some potential shortcomings are inevitable, i.e., information redundancy resulting in low efficiency and information loss during the sampling process for captioning. To address these problems, in this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, captured by a snapshot compressive sensing camera, and we dub our model SnapCap. To be more specific, benefiting from signal simulation, we have access to abundant measurement-video-annotation data pairs for our model. Besides, to better extract language-related visual representations from the compressed measurement, we propose to distill knowledge from videos via a pretrained model, contrastive language-image pretraining (CLIP), with plentiful language-vision associations to guide the learning of our SnapCap. To demonstrate the effectiveness of SnapCap, we conduct experiments on three widely used VC datasets. Both the qualitative and quantitative results verify the superiority of our pipeline over conventional VC pipelines.
    Y=k=1BXkCk+N,

    View in Article

    fconvt=Mean(Conv1(X1,,XB))Rc×h×w,

    View in Article

    flatent=f(Y,C).

    View in Article

    fconvs=Conv2(flatent),fconvsRc×h×w.

    View in Article

    ft=Mean(T(X1,,XB))Rd,

    View in Article

    fs=S(Y,C)Rd.

    View in Article

    Lconv=LMSE(fconvs,fconvt),

    View in Article

    Lemb=LMSE(fs,ft),

    View in Article

    Ldis=Lconv+αLemb,

    View in Article

    X^=g(f(Y,C)),Lreg=L1(X^,X).

    View in Article

    t=Proj(fs),tRD,

    View in Article

    c<i=PLM(y<i),

    View in Article

    zi=Concat(t,c<i),

    View in Article

    p(Yi)=Dec(zi),

    View in Article

    Lcap=i=1Llogp(yi*|fs,y<i*),

    View in Article

    Ltotal=Ldis+βLcap,

    View in Article

    fs=S(Y,C).

    View in Article

    Jianqiao Sun, Yudi Su, Hao Zhang, Ziheng Cheng, Zequn Zeng, Zhengjue Wang, Chunhui Qu, Bo Chen, Xin Yuan, "SnapCap: efficient snapshot compressive scene captioning," Adv. Imaging 2, 011003 (2025)
    Download Citation