• Advanced Imaging
  • Vol. 2, Issue 1, 011003 (2025)
Jianqiao Sun1, Yudi Su1, Hao Zhang1,*, Ziheng Cheng1..., Zequn Zeng1, Zhengjue Wang2, Chunhui Qu1, Bo Chen1,* and Xin Yuan3,*|Show fewer author(s)
Author Affiliations
  • 1National Key Laboratory of Radar Signal Processing, Xidian University, Xi’an, China
  • 2School of Telecommunications Engineering, Xidian University, Xi’an, China
  • 3School of Engineering, Westlake University, Hangzhou, China
  • show less
    DOI: 10.3788/AI.2025.10021 Cite this Article Set citation alerts
    Jianqiao Sun, Yudi Su, Hao Zhang, Ziheng Cheng, Zequn Zeng, Zhengjue Wang, Chunhui Qu, Bo Chen, Xin Yuan, "SnapCap: efficient snapshot compressive scene captioning," Adv. Imaging 2, 011003 (2025) Copy Citation Text show less
    Comparing our efficient captioning pipeline in (c) with the traditional (multi-stage) pipeline in (a) and a potential two-stage solution in (b), indicated by red, blue, and yellow, respectively.
    Fig. 1. Comparing our efficient captioning pipeline in (c) with the traditional (multi-stage) pipeline in (a) and a potential two-stage solution in (b), indicated by red, blue, and yellow, respectively.
    Comparisons on GPU memory, inference time, and CIDEr score of typical VC methods, where red, blue, and yellow indicate our method, traditional multi-stage VC methods, and two-stage methods, respectively. The size of the circle is proportional to the CIDEr score (↑) marked in brackets.
    Fig. 2. Comparisons on GPU memory, inference time, and CIDEr score of typical VC methods, where red, blue, and yellow indicate our method, traditional multi-stage VC methods, and two-stage methods, respectively. The size of the circle is proportional to the CIDEr score (↑) marked in brackets.
    An illustration of a typical video snapshot CS system, CACTI[12].
    Fig. 3. An illustration of a typical video snapshot CS system, CACTI[12].
    Learning and inference workflows of our proposed SnapCap. The cooperation of (a)–(c) is for training, and only (b) is needed for an end-to-end captioning during testing.
    Fig. 4. Learning and inference workflows of our proposed SnapCap. The cooperation of (a)–(c) is for training, and only (b) is needed for an end-to-end captioning during testing.
    Qualitative results on the MSRVTT dataset. We exhibit the compressed measurement, predicted caption by SnapCap, and the ground truth annotations. For a better understanding, we also show the ground truth video frames.
    Fig. 5. Qualitative results on the MSRVTT dataset. We exhibit the compressed measurement, predicted caption by SnapCap, and the ground truth annotations. For a better understanding, we also show the ground truth video frames.
    Qualitative results on the MSVD dataset. We exhibit the compressed measurement, predicted caption by SnapCap, and the ground truth annotations. For a better understanding, we also show the ground truth video frames.
    Fig. 6. Qualitative results on the MSVD dataset. We exhibit the compressed measurement, predicted caption by SnapCap, and the ground truth annotations. For a better understanding, we also show the ground truth video frames.
    Caption quality (in terms of CIDEr value) comparison of different methods and different multiple compression ratios (Refs. [8,16,23,31]).
    Fig. 7. Caption quality (in terms of CIDEr value) comparison of different methods and different multiple compression ratios (Refs. [8,16,23,31]).
    Comparision of captioning results (our model prediction and two-stage model prediction) on two color real data. The top row is about Ball Rotate, and the bottom row is about Hammer. For better understanding, we also plot the reconstructed results of STFromer (top part) and BIRNAT (bottom part).
    Fig. 8. Comparision of captioning results (our model prediction and two-stage model prediction) on two color real data. The top row is about Ball Rotate, and the bottom row is about Hammer. For better understanding, we also plot the reconstructed results of STFromer (top part) and BIRNAT (bottom part).
    Comparison of captioning results (our model prediction and two-stage model prediction) on four grayscale real data. From top to bottom, it is about the Domino, hand, pendulum, and Water Ballon. For better understanding, we also plot the reconstructed results of STFromer.
    Fig. 9. Comparison of captioning results (our model prediction and two-stage model prediction) on four grayscale real data. From top to bottom, it is about the Domino, hand, pendulum, and Water Ballon. For better understanding, we also plot the reconstructed results of STFromer.
    MethodInput modalityMSRVTTMSVD
    BMRCBMRC
    Video frame-based methods
    Recent[23]Va39.126.659.342.752.334.169.880.3
    SGN[8]V + M40.828.360.849.552.835.572.994.3
    HMN[30]V + M + D43.529.062.751.559.237.775.1104.0
    CoCap[44]V43.129.862.756.255.939.976.8113.0
    CoCap[44]V (ViT-L/14)44.130.363.457.260.141.478.2121.5
    RSFD[31]V + M + A43.429.362.353.151.235.772.996.7
    IcoCap[43]CLIP features47.031.164.960.259.139.576.5110.3
    Our TeaCapbV45.630.663.958.356.139.276.7114.9
    Coded measurement-based methods
    Our SnapCapCoded measurement44.230.163.256.754.938.275.4108.9
    Our SnapCap(ViT/L-14)Coded measurement47.231.165.160.560.340.978.8117.1
    Table 1. A Comparison of Proposed Efficient Measurement-Based Captioning and Different Video-Based VC Methods on MSRVTT and MSVD.
    Data: Distribution over video frames p(T).
    Input: Masks {Ck}k=1B, loss coefficients α and β, a pre-trained Language Encoder PLM(·).
    Output: Trained parameters for encoder f(·,·), student model S(·), projector Proj(·), and the language decoder Dec(·).
    1.  For epoch = 1, 2, …, 30 do
    2.   Randomly sample a video Tip(T), and sample B video frames {Xk}k=1B from Ti;
    3.   Simulate the coded measurement Y with masks {Ck}k=1B as in Eq. (1);
    4.   Input the measurement Y and masks {Ck}k=1B to the encoder f(·,·) and then decoder g(·,·) to obtain X^ as in Eq. (10);
    5.   Update the parameters of encoder f(·,·) and decoder g(·,·) through the regularization loss as in Eq. (10).
    6.  End
    7.  For epoch=1, 2, …, 30 do
    8.   Randomly sample a video Ti and generate the coded measurement Y with masks {Ck}k=1B as in Eq. (1) from B video frames {Xk}k=1B;
    9.   Input the measurement Y and masks {Ck}k=1B to the encoder f(·,·) to get the latent representations flatent as in Eq. (3);
    10.   Input the latent representation of measurement to the student model S(·) to get the feature maps fconvs as in Eq. (4) and the visual embedding fs as in Eq. (6);
    11.   Input the video frames {Xk}k=1B to the teacher model S(·) to obtain the feature maps fconvt as in Eq. (2) and the visual embedding ft as in Eq. (5);
    12.   Compute the distillation loss Ldis as in Eq. (7) to Eq. (9);
    13.   Input visual embedding fs to a projector Proj(·) to obtain t as in Eq. (11);
    14.   Input the ground truth annotation to the PLM(·) and generate the predicted caption word-by-word through the language decoder Dec(·) as in Eq. (12) to Eq. (14);
    15.   Update the parameters encoder f(·,·), student model S(·), projector Proj(·), and the language decoder Dec(·).
    16. End
    Table 1. Training Stage
    Compressed inputMethodGPU memory (GB)Inference time (ms)MSRVTT
    Dec./Rec.aCap.bDec./Rec.Cap.TotalBMRC
    Encoded video by H.264ffmpegHMN[30]11.1421430147243.529.062.751.5
    ffmpegRSFD[31]10.1431476151943.429.362.353.1
    CoCap[44]c5.2038738743.129.862.756.2
    Coded measurementdBIRNAT[5]TeaCap6.245618564141.427.660.747.9
    STFormer[6]TeaCap17.0825183100841.728.661.350.8
    EfficientSCI[7]TeaCap12.861918280141.328.861.650.8
    Our SnapCap5.7019719744.230.163.256.7
    Table 2. Comparison of GPU Memory, Inference Time, and Captioning Results Using Different Strategies.
    Data: Coded measurement Y and masks {Ck}k=1B.
    Input: Trained model encoder f(·,·), student model S(·), projector Proj(·), and the language decoder Dec(·).
    Output: Predicted captions.
    1.  Input the measurement Y and masks {Ck}k=1B to the encoder f(·,·) to get the latent representations flatent as in Eq. (3).
    2.  Input the latent representation of measurement to the student model S(·) to get the visual embedding fs as in Eq. (6).
    3.  Input visual embedding fs to a projector Proj(·) to obtain t as in Eq. (11).
    4.  Generate the predicted caption word-by-word through the language decoder Dec(·) as in Eq. (14).
    Table 2. Inference Stage
    LregLdisBMRC
    ××24.721.752.016.8
    ×32.122.655.629.3
    ×33.024.957.031.6
    44.230.163.256.7
    Table 3. Contributions of Regularization Loss and Distillation Loss on the MSRVTT Dataset.a
    Jianqiao Sun, Yudi Su, Hao Zhang, Ziheng Cheng, Zequn Zeng, Zhengjue Wang, Chunhui Qu, Bo Chen, Xin Yuan, "SnapCap: efficient snapshot compressive scene captioning," Adv. Imaging 2, 011003 (2025)
    Download Citation