SnapCap: efficient snapshot compressive scene captioning

Jianqiao Sun; Yudi Su; Hao Zhang; Ziheng Cheng; Zequn Zeng; Zhengjue Wang; Chunhui Qu; Bo Chen; Xin Yuan

doi:10.3788/AI.2025.10021

Method

Input modality

MSRVTT

MSVD

Video frame-based methods

Recent^[23]

V^a

39.1

26.6

59.3

42.7

52.3

34.1

69.8

80.3

SGN^[8]

V + M

40.8

28.3

60.8

49.5

52.8

35.5

72.9

94.3

HMN^[30]

V + M + D

43.5

29.0

62.7

51.5

59.2

37.7

75.1

104.0

CoCap^[44]

43.1

29.8

62.7

56.2

55.9

39.9

76.8

113.0

CoCap^[44]

V (ViT-L/14)

44.1

30.3

63.4

57.2

60.1

41.4

78.2

121.5

RSFD^[31]

V + M + A

43.4

29.3

62.3

53.1

51.2

35.7

72.9

96.7

IcoCap^[43]

CLIP features

47.0

31.1

64.9

60.2

59.1

39.5

76.5

110.3

Our TeaCap^b

45.6

30.6

63.9

58.3

56.1

39.2

76.7

114.9

Coded measurement-based methods

Our SnapCap

Coded measurement

44.2

30.1

63.2

56.7

54.9

38.2

75.4

108.9

Our SnapCap(ViT/L-14)

Coded measurement

47.2

31.1

65.1

60.5

60.3

40.9

78.8

117.1

Data: Distribution over video frames

p (T)

Input: Masks

{C_{k}}_{k = 1}^{B}

, loss coefficients

α

and

β

, a pre-trained Language Encoder

PLM (\cdot)

Output: Trained parameters for encoder

f (\cdot, \cdot)

, student model

S (\cdot)

, projector

Proj (\cdot)

, and the language decoder

Dec (\cdot)

1. For epoch = 1, 2, …, 30 do

2. Randomly sample a video

T_{i} \sim p (T)

, and sample

B

video frames

{X_{k}}_{k = 1}^{B}

from

T_{i}

;

3. Simulate the coded measurement

Y

with masks

{C_{k}}_{k = 1}^{B}

as in Eq. (1);

4. Input the measurement

Y

and masks

{C_{k}}_{k = 1}^{B}

to the encoder

f (\cdot, \cdot)

and then decoder

g (\cdot, \cdot)

to obtain

\hat{X}

as in Eq. (10);

5. Update the parameters of encoder

f (\cdot, \cdot)

and decoder

g (\cdot, \cdot)

through the regularization loss as in Eq. (10).

6. End

7. For epoch=1, 2, …, 30 do

8. Randomly sample a video

T_{i}

and generate the coded measurement

Y

with masks

{C_{k}}_{k = 1}^{B}

as in Eq. (1) from

B

video frames

{X_{k}}_{k = 1}^{B}

;

9. Input the measurement

Y

and masks

{C_{k}}_{k = 1}^{B}

to the encoder

f (\cdot, \cdot)

to get the latent representations

f_{latent}

as in Eq. (3);

10. Input the latent representation of measurement to the student model

S (\cdot)

to get the feature maps

f_{conv}^{s}

as in Eq. (4) and the visual embedding

f^{s}

as in Eq. (6);

11. Input the video frames

{X_{k}}_{k = 1}^{B}

to the teacher model

S (\cdot)

to obtain the feature maps

f_{conv}^{t}

as in Eq. (2) and the visual embedding

f^{t}

as in Eq. (5);

12. Compute the distillation loss

L_{dis}

as in Eq. (7) to Eq. (9);

13. Input visual embedding

f^{s}

to a projector

Proj (\cdot)

to obtain

t

as in Eq. (11);

14. Input the ground truth annotation to the

PLM (\cdot)

and generate the predicted caption word-by-word through the language decoder

Dec (\cdot)

as in Eq. (12) to Eq. (14);

15. Update the parameters encoder

f (\cdot, \cdot)

, student model

S (\cdot)

, projector

Proj (\cdot)

, and the language decoder

Dec (\cdot)

16. End

Compressed input

Method

GPU memory (GB)

Inference time (ms)

MSRVTT

Dec./Rec.a

Cap.b

Dec./Rec.

Cap.

Total

Encoded video by H.264

ffmpeg

HMN^[30]

11.1

1430

1472

43.5

29.0

62.7

51.5

ffmpeg

RSFD^[31]

10.1

1476

1519

43.4

29.3

62.3

53.1

CoCap[44]c

5.2

387

43.1

29.8

62.7

56.2

Coded measurement^d

BIRNAT^[5]

TeaCap

6.2

456

185

641

41.4

27.6

60.7

47.9

STFormer^[6]

TeaCap

17.0

825

183

1008

41.7

28.6

61.3

50.8

EfficientSCI^[7]

TeaCap

12.8

619

182

801

41.3

28.8

61.6

50.8

Our SnapCap

5.7

197

44.2

30.1

63.2

56.7

Data: Coded measurement

Y

and masks

{C_{k}}_{k = 1}^{B}

Input: Trained model encoder

f (\cdot, \cdot)

, student model

S (\cdot)

, projector

Proj (\cdot)

, and the language decoder

Dec (\cdot)

Output: Predicted captions.

1. Input the measurement

Y

and masks

{C_{k}}_{k = 1}^{B}

to the encoder

f (\cdot, \cdot)

to get the latent representations

f_{latent}

as in Eq. (3).

2. Input the latent representation of measurement to the student model

S (\cdot)

to get the visual embedding

f^{s}

as in Eq. (6).

3. Input visual embedding

f^{s}

to a projector

Proj (\cdot)

to obtain

t

as in Eq. (11).

4. Generate the predicted caption word-by-word through the language decoder

Dec (\cdot)

as in Eq. (14).

L_{reg}

L_{dis}

24.7

21.7

52.0

16.8

√

32.1

22.6

55.6

29.3

√

33.0

24.9

57.0

31.6

√

44.2

30.1

63.2

56.7

Jianqiao Sun, Yudi Su, Hao Zhang, Ziheng Cheng, Zequn Zeng, Zhengjue Wang, Chunhui Qu, Bo Chen, Xin Yuan, "SnapCap: efficient snapshot compressive scene captioning," Adv. Imaging 2, 011003 (2025)

Download Citation

EndNote(RIS)

BibTex

Plain Text

Set citation alerts for the article

Tools

Set citation alerts for the article

Save the article for my favorites

Paper Information

微信扫一扫：分享

微信扫一扫：分享