SnapCap: efficient snapshot compressive scene captioning

Jianqiao Sun; Yudi Su; Hao Zhang; Ziheng Cheng; Zequn Zeng; Zhengjue Wang; Chunhui Qu; Bo Chen; Xin Yuan

doi:10.3788/AI.2025.10021

Abstract

Describing a scene in language is a challenging multi-modal task as it requires understanding various and complex scenes, and then transforming them into sentences. Among these scenes, the task of video captioning (VC) has attracted much attention from researchers. For machines, traditional VC follows the “imaging-compression-decoding-and-then-captioning” pipeline, where compression is a pivot for storage and transmission. However, in such a pipeline, some potential shortcomings are inevitable, i.e., information redundancy resulting in low efficiency and information loss during the sampling process for captioning. To address these problems, in this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, captured by a snapshot compressive sensing camera, and we dub our model SnapCap. To be more specific, benefiting from signal simulation, we have access to abundant measurement-video-annotation data pairs for our model. Besides, to better extract language-related visual representations from the compressed measurement, we propose to distill knowledge from videos via a pretrained model, contrastive language-image pretraining (CLIP), with plentiful language-vision associations to guide the learning of our SnapCap. To demonstrate the effectiveness of SnapCap, we conduct experiments on three widely used VC datasets. Both the qualitative and quantitative results verify the superiority of our pipeline over conventional VC pipelines.

View in Article

View in Article

View in Article

View in Article

View in Article

View in Article

View in Article

View in Article

View in Article

View in Article

View in Article

View in Article

View in Article

View in Article

View in Article

View in Article

View in Article

微信扫一扫：分享

微信扫一扫：分享