Author Affiliations
1National Key Laboratory of Radar Signal Processing, Xidian University, Xi’an, China2School of Telecommunications Engineering, Xidian University, Xi’an, China3School of Engineering, Westlake University, Hangzhou, Chinashow less
Fig. 1. Comparing our efficient captioning pipeline in (c) with the traditional (multi-stage) pipeline in (a) and a potential two-stage solution in (b), indicated by red, blue, and yellow, respectively.
Fig. 2. Comparisons on GPU memory, inference time, and CIDEr score of typical VC methods, where red, blue, and yellow indicate our method, traditional multi-stage VC methods, and two-stage methods, respectively. The size of the circle is proportional to the CIDEr score (↑) marked in brackets.
Fig. 3. An illustration of a typical video snapshot CS system, CACTI
[12].
Fig. 4. Learning and inference workflows of our proposed SnapCap. The cooperation of (a)–(c) is for training, and only (b) is needed for an end-to-end captioning during testing.
Fig. 5. Qualitative results on the MSRVTT dataset. We exhibit the compressed measurement, predicted caption by SnapCap, and the ground truth annotations. For a better understanding, we also show the ground truth video frames.
Fig. 6. Qualitative results on the MSVD dataset. We exhibit the compressed measurement, predicted caption by SnapCap, and the ground truth annotations. For a better understanding, we also show the ground truth video frames.
Fig. 7. Caption quality (in terms of CIDEr value) comparison of different methods and different multiple compression ratios (Refs. [
8,
16,
23,
31]).
Fig. 8. Comparision of captioning results (our model prediction and two-stage model prediction) on two color real data. The top row is about Ball Rotate, and the bottom row is about Hammer. For better understanding, we also plot the reconstructed results of STFromer (top part) and BIRNAT (bottom part).
Fig. 9. Comparison of captioning results (our model prediction and two-stage model prediction) on four grayscale real data. From top to bottom, it is about the Domino, hand, pendulum, and Water Ballon. For better understanding, we also plot the reconstructed results of STFromer.
Method | Input modality | MSRVTT | MSVD | B | M | R | C | B | M | R | C | Video frame-based methods | Recent[23] | Va | 39.1 | 26.6 | 59.3 | 42.7 | 52.3 | 34.1 | 69.8 | 80.3 | SGN[8] | V + M | 40.8 | 28.3 | 60.8 | 49.5 | 52.8 | 35.5 | 72.9 | 94.3 | HMN[30] | V + M + D | 43.5 | 29.0 | 62.7 | 51.5 | 59.2 | 37.7 | 75.1 | 104.0 | CoCap[44] | V | 43.1 | 29.8 | 62.7 | 56.2 | 55.9 | 39.9 | 76.8 | 113.0 | CoCap[44] | V (ViT-L/14) | 44.1 | 30.3 | 63.4 | 57.2 | 60.1 | 41.4 | 78.2 | 121.5 | RSFD[31] | V + M + A | 43.4 | 29.3 | 62.3 | 53.1 | 51.2 | 35.7 | 72.9 | 96.7 | IcoCap[43] | CLIP features | 47.0 | 31.1 | 64.9 | 60.2 | 59.1 | 39.5 | 76.5 | 110.3 | Our TeaCapb | V | 45.6 | 30.6 | 63.9 | 58.3 | 56.1 | 39.2 | 76.7 | 114.9 | Coded measurement-based methods | Our SnapCap | Coded measurement | 44.2 | 30.1 | 63.2 | 56.7 | 54.9 | 38.2 | 75.4 | 108.9 | Our SnapCap(ViT/L-14) | Coded measurement | 47.2 | 31.1 | 65.1 | 60.5 | 60.3 | 40.9 | 78.8 | 117.1 |
|
Table 1. A Comparison of Proposed Efficient Measurement-Based Captioning and Different Video-Based VC Methods on MSRVTT and MSVD.
Data: Distribution over video frames . | Input: Masks , loss coefficients and , a pre-trained Language Encoder . | Output: Trained parameters for encoder , student model , projector , and the language decoder . | 1. For epoch = 1, 2, …, 30 do | 2. Randomly sample a video , and sample video frames from ; | 3. Simulate the coded measurement with masks as in Eq. (1); | 4. Input the measurement and masks to the encoder and then decoder to obtain as in Eq. (10); | 5. Update the parameters of encoder and decoder through the regularization loss as in Eq. (10). | 6. End | 7. For epoch=1, 2, …, 30 do | 8. Randomly sample a video and generate the coded measurement with masks as in Eq. (1) from video frames ; | 9. Input the measurement and masks to the encoder to get the latent representations as in Eq. (3); | 10. Input the latent representation of measurement to the student model to get the feature maps as in Eq. (4) and the visual embedding as in Eq. (6); | 11. Input the video frames to the teacher model to obtain the feature maps as in Eq. (2) and the visual embedding as in Eq. (5); | 12. Compute the distillation loss as in Eq. (7) to Eq. (9); | 13. Input visual embedding to a projector to obtain as in Eq. (11); | 14. Input the ground truth annotation to the and generate the predicted caption word-by-word through the language decoder as in Eq. (12) to Eq. (14); | 15. Update the parameters encoder , student model , projector , and the language decoder . | 16. End |
|
Table 1. Training Stage
Compressed input | Method | GPU memory (GB) | Inference time (ms) | MSRVTT | Dec./Rec.a | Cap.b | Dec./Rec. | Cap. | Total | B | M | R | C | Encoded video by H.264 | ffmpeg | HMN[30] | 11.1 | 42 | 1430 | 1472 | 43.5 | 29.0 | 62.7 | 51.5 | ffmpeg | RSFD[31] | 10.1 | 43 | 1476 | 1519 | 43.4 | 29.3 | 62.3 | 53.1 | CoCap[44]c | 5.2 | 0 | 387 | 387 | 43.1 | 29.8 | 62.7 | 56.2 | Coded measurementd | BIRNAT[5] | TeaCap | 6.2 | 456 | 185 | 641 | 41.4 | 27.6 | 60.7 | 47.9 | STFormer[6] | TeaCap | 17.0 | 825 | 183 | 1008 | 41.7 | 28.6 | 61.3 | 50.8 | EfficientSCI[7] | TeaCap | 12.8 | 619 | 182 | 801 | 41.3 | 28.8 | 61.6 | 50.8 | Our SnapCap | 5.7 | 0 | 197 | 197 | 44.2 | 30.1 | 63.2 | 56.7 |
|
Table 2. Comparison of GPU Memory, Inference Time, and Captioning Results Using Different Strategies.
Data: Coded measurement and masks . | Input: Trained model encoder , student model , projector , and the language decoder . | Output: Predicted captions. | 1. Input the measurement and masks to the encoder to get the latent representations as in Eq. (3). | 2. Input the latent representation of measurement to the student model to get the visual embedding as in Eq. (6). | 3. Input visual embedding to a projector to obtain as in Eq. (11). | 4. Generate the predicted caption word-by-word through the language decoder as in Eq. (14). |
|
Table 2. Inference Stage
| | B | M | R | C | × | × | 24.7 | 21.7 | 52.0 | 16.8 | √ | × | 32.1 | 22.6 | 55.6 | 29.3 | × | √ | 33.0 | 24.9 | 57.0 | 31.6 | √ | √ | 44.2 | 30.1 | 63.2 | 56.7 |
|
Table 3. Contributions of Regularization Loss and Distillation Loss on the MSRVTT Dataset.a