
- Advanced Imaging
- Vol. 1, Issue 2, 021002 (2024)
Abstract
1. Introduction
Image and video compression algorithms have been developed for about 30 years, while state-of-the-art coders are still mainly based on the moving picture experts group MPEG) structure[1], which was originally developed for broadcasting. However, 30 years ago, videos were usually captured and produced in studios and thus the encoding time and cost can be very long and expensive, while the decoder at the customer’s end needs to be of low complexity since it is in every family home (on television). Nowadays, image and video codecs are all over mobile platforms like cellphones, drones, etc., where images and videos can be captured anywhere, anytime, given enough power. Moreover, the devices (e.g., cell phones, laptops, and desktops) we now use to decode image and video streams have by magnitude higher computation power than it was decades ago. Therefore, now it is natural for the image and video compression to evolve towards the combination of a low-cost encoder and a (possibly) computationally heavy decoder. The low-complexity encoder has been studied in the literature of information and communication theory under the topic distributed source coding (DSC)[2]. A low-cost encoder is desired because we aim for applications on resource-limited robotic platforms such as drones.
In this paper, we consider the image and video compression codecs on those mobile platforms with sensitive constraints on battery, computation, and bandwidth. In particular, we highlight drones and robotics as representative applications. In these use cases, only the low-cost encoder needs to be implemented in real time on the mobile platform, but the decoding can happen after transmission on other tabletop platforms such as edges[3] or the cloud. Since most of these mobile platforms are running on standalone batteries, a power saving on the encoder can extend the running time of other sensors and motion modules on drones or robotics, which is of significant interest in extreme cases such as moon rovers and other military applications.
Bearing this concern in mind, we propose an ultralow-complexity image and video codec using block modulation, dubbed the block-modulating video compression (BMVC) codec. The underlying principle of BMVC is to mask the high-resolution image (via predefined binary random coding patterns composed of {0,1}) and then decompose it into different small (modulated) blocks. Finally, these blocks are summed up to a single block and quantized as the compressed signal to be transmitted. Since no multiplication is involved during this encoding process, the complexity of the BMVC encoder is way lower than MPEG-based codecs. Moreover, the summation over modulated image blocks by binary masks can be essentially implemented as additions of pixel readouts according to a predefined look-up table during this encoding process.
Therefore, the computation cost of the BMVC encoder can be minimized, compared to transform-based (DCT, DWT) compression algorithms. Hereby, the mask pattern plays the role of basis or key during the compression, which is predeployed on the mobile platform. Without loss of generality, we use random binary patterns, with equal probability being 1 or 0, which is stored as a look-up table on the mobile platform. Note that only a single encoding mask is generated to encode all images and videos.
1.1. BMVC pipeline: key idea
Figure 1 shows the basic principle of the proposed BMVC encoder, where the input is a raw image captured by the sensor, e.g., a charge-coupled device (CCD) or complementary metal–oxide semiconductor (CMOS) camera, with a spatial resolution of
Figure 1.Pipeline of the proposed BMVC encoder. (a) For each input image with a size of
Both the high definition (HD) image and the mask are divided into nonoverlapping blocks of size
On the decoder side, after receiving the encoded block, the BMVC decoding algorithms are employed to reconstruct the original HD image, provided the masks known a priori used in the encoder. In this paper, we consider two reconstruction algorithms, one being the plug-and-play (PnP)[4–7] optimization algorithm (BMVC-PnP) with a deep denoising neural network in each iteration[8] and the other being an end-to-end convolutional neural network (BMVC-E2E) for real-time video decoding.
The proposed BMVC pipeline is frame-independent, meaning that each image or frame (from a video sequence) can be independently decoded without knowing its previous image or frame.
While it is true that in video compression applications, further exploration of the temporal redundancy can significantly boost the compression ratio, the extra computing of temporal difference between frames inevitably increases the encoding cost. Our goal here is to minimize the computation cost on the encoder hardware so that the resources can be saved for other modules on the robotics.
Therefore, we do not consider temporal processing in our current BMVC pipeline. However, it is worth mentioning that any type of temporal processing is compatible and can be added to the existing BMVC pipeline for a higher compression ratio at the cost of increased encoder complexity.
1.2. Contributions
We want to emphasize that BMVC is not designed to replace the current codec standard but to provide an alternative option for platforms under extreme power/computation constraints.
Here we summarize the contributions of this work:
2. Background Knowledge
Similar to other codecs, the proposed BMVC pipeline is composed of an encoder and a decoder. As shown in Fig. 1(a), the encoder consists of blocking, masking, summation, and quantization.
It is worth noting that the blocking and masking steps can be switched; specifically, a mask of size
One underlying principle of the BMVC encoder is compressive sensing (CS)[9,10], where a small number of measurements can be used to reconstruct the full signal with a higher dimension under some mild conditions. Specifically, BMVC shares the same spirit with snapshot compressive imaging (SCI)[11]. In the following sections, we review the basic idea of SCI and deep-learning (DL) methods for CS inversion, i.e., the reconstruction process.
2.1. SCI
SCI utilizes a 2D detector to capture high-dimensional (
It was first developed to capture high-speed videos (spatiotemporal data cubes) with a relatively low-speed camera[12–14] or to capture hyperspectral images (spatiospectral data cubes) in a single shot[15,16] among other applications[17–22]. Generally, the fundamental idea of SCI is to modulate each low-dimensional slice (frames or spectral channels) from an underlying high-dimensional data structure with a different mask and then sum these modulated slices into a single low-dimensional measurement.
In our proposed BMVC encoding process, as shown in Fig. 1, we can interpret its forward model in a slightly different way following the same spirit of SCI. Consider the image blocks in BMVC as low-dimensional slices in the SCI modality, with the modulation masks corresponding to the SCI masking scheme. Then the BMVC-modulated blocks are summed to a single block measurement, which corresponds to the compressed measurement in SCI. From this perspective, the encoding process of BMVC is essentially the same as that of SCI, whereas there is a key difference between BMVC and SCI. That is, in SCI, the frames in the video sequence or the spectral channels in the hyperspectral data cube are strongly correlated, as they share highly similar spatial features, while in BMVC, each block corresponds to a different portion of the image, and these blocks are not necessarily correlated, which makes the BMVC decoding more challenging than SCI modalities. Though the decoding task poses a big challenge, in this paper we show that using advanced DL-based CS inversion algorithms, high-resolution images can still be faithfully decoded from the highly ill-posed, low-cost BMVC encoder.
2.2. DL for SCI inversion
The inverse problem of CS is ill posed, as there are more unknown parameters to be estimated than known measurements. Toward this end, different priors have been employed as regularizations to help solve the CS inverse problem. Widely used priors include sparsity[23] and piece-wise smoothness such as total variation (TV) and low rankness of similar image patches.
Implicit regularizations have also been explored using standard denoising algorithms (such as NLM[24] and BM3D[25]) as PnP priors[26]. Other algorithms have also been used to solve the SCI reconstruction, such as TwIST[27], GAP-TV[28], and DeSCI[29].
Recently, DL has been used for solving inverse problems in computational imaging systems for high-quality and high-speed reconstruction. Specially, existing DL-based inversion algorithms can be categorized into three different classes[11]: (1) end-to-end convolutional neural networks (E2E-CNNs) for high-speed reconstruction[14,16,30–33], (2) deep unfolding/unrolling networks[34–40] with interpretability, and (3) the PnP algorithms[41,42] using pretrained denoising networks as implicit priors to regularize inverse problems.
For the BMVC decoding case considered in this work, due to the large scale of the data, we employ two decoding algorithms: a BMVC-PnP and a BMVC-E2E. Regarding the data dimension, we consider the HD data of size
3. BMVC
In this section, we elaborate on the mathematical details of the BMVC pipeline. In particular, we use gray-scale images as an example to derive the mathematical model. As mentioned before, BMVC is ready to handle RGB color images and videos, where we can conduct BMVC on all RGB channels or YUV (or YCbCr) channels. In our experiments, we have further found that decent decoding results can be obtained by only performing BMVC on the Y channel and a simple downsampling/upsampling can be used for the U and V channels.
Let
Note that in practice, Eq. (1) can be performed by summations according to a look-up table rather than any multiplication. As a result, the actual computational pipeline of the BMVC encoder requires only additions and no multiplications. Detailed analysis of the BMVC complexity can be found in Sec. 4.3.
Generally, the number of blocks
3.1. Compressed measurement
After blocking and binary modulation, the next step is to sum all these modulated blocks to yield a single compressed measurement, i.e.,
This
Before sending the measurement, the last step is bit quantization, which imposes additional quantization error to the actual compressed signal
3.2. Forward model in vectorized formulation
For the sake of describing the reconstruction algorithm employed in the decoder, we hereby introduce the vectorized formulation of Eq. (3). Let
After vectorization, Eq. (3) can be reformulated as
When the quantization error
It is worth noting that Eq. (11) has the formulation of CS, but with a sensing matrix
Though there is a matrix multiplication in Eq. (11), during the encoding process, there is only masking (implemented by a look-up table) and summation as in Eq. (3). Note that, as mentioned in the introduction, only a single full-sized mask is needed to be predefined and stored for both encoder and decoder. This mask can be performed on consequent frames in a video sequence. This mask can also be designed for different user applications or designed for encryption purposes, for which only an authorized person can access the mask and then decode the video, while the compressed measurement is actually encrypted. This provides another benefit of BMVC.
3.3. PnP-based decoder of BMVC
So far, we have shown that the BMVC encoder has an extremely low complexity, which makes it perfect for use cases on resource-limited mobile platforms. On the other hand, the decoding process of BMVC becomes highly challenging due to the huge dimensionality mismatch (up to
Here we first discuss an optimization-based decoder that utilizes the PnP[6,7] algorithm. When the decoder receives the encoded image block from the encoder, the goal is to reconstruct the desired image, provided the modulation masks. Following the formulation in Eq. (11), the decoding is an ill-posed problem and thus similar to CS; priors need to be employed in the optimization,
Since the specific structure of the BMVC encoding operator
To be concrete, the BMVC-PnP is an iterative decoder. Starting from
Due to the diagonal structure of
Figure 2.PnP optimization-based decoding algorithm for BMVC-PnP. The encoded image block along with the modulation binary masks are fed into the BMVC-PnP decoder as inputs. The BMVC-PnP iteratively performs a linear projection step to account for the BMVC encoding process and a DL-based denoising step as an implicit prior. We use a pretrained FFDNet43 as the denoising CNN for its flexibility and robustness against various noise levels.
Though the CNN-based FFDNet is very efficient for denoising, the BMVC-PnP decoder is still an iterative algorithm, and thus it cannot provide real-time results. For instance, in the experimental results, we let the BMVC-PnP decoder run for 60 iterations while gradually decrease the
3.4. End-to-end CNN decoder of BMVC
To address the speed issue and enable real-time BMVC applications, we present a second BMVC decoder that employs an end-to-end CNN architecture for faster BMVC decoding. In the following context, we will term the CNN-based end-to-end BMVC decoder as BMVC-E2E.
In order to make the BMVC-E2E robust and interpretable, we design the feed-forward CNN architecture based on unrolling the PnP optimization framework to a few stages[47], as shown in Fig. 3. Similarly, each stage now contains a linear projection operator to account for the BMVC forward encoding process, and a CNN to serve as an implicit regularization. Note that though the BMVC-E2E follows the structure of an unrolled optimization, it has two unique features, making it different from the PnP approach. First, unlike the CNN in the PnP approach, which is independently trained as an ad hoc denoiser, the BMVC-E2E decoder is trained in an end-to-end (multiple stage jointly) fashion to perform direct decoding. Second, remember that the whole purpose of the BMVC-E2E decoder is to accelerate the inference time towards real-time decoding. The BMVC-E2E needs only a few stages (usually 2–3 stages are enough), while the BMVC-PnP generally requires
Figure 3.E2E neural-network-based decoding algorithm for BMVC-E2E. The encoded image block along with the modulation binary masks are fed into the BMVC-E2E decoder as inputs. The feed-forward BMVC-E2E decoder consists of several stages, where each stage contains a linear projection step and a convolutional neural network. All BMVC-E2E decoders are trained in an E2E fashion. We use 2D-U-Net and 3D-CNN with reversible blocks (RevSCI) to facilitate memory-efficient training.
The linear projection step follows the identical derivation, as in Eq. (13). While the general structure of deep unfolding is easy to understand, how to design the network in each stage is extremely challenging for efficient BMVC decoding. As mentioned before, different from other applications of SCI, in BMVC, each block is not necessarily correlated with others. Therefore, it is important to extract nonlocal information in different blocks during BMVC decoding. Toward this end, after extensive experiments, we have found that 3D-CNN architecture is powerful enough to conduct this task by extracting information across nonlocal blocks. However, this introduces the challenge of running time, since a 3D-CNN model usually needs 12×−25× longer than its 2D-CNN counterparts.
Therefore, a trade-off between speed and quality has to be made. To balance between decoding quality and running time, we conducted extensive experiments on 2D- and 3D-CNN structures and how many stages we unroll the network into. Finally, we identified a few key observations to have fast and high-quality decoding. First, the BMVC-E2E decoder needs to incorporate at least two stages. This is because the first stage mainly conducts an initial inpainting of the missing pixels and generates blurry results. Additional stages are essential to retrieve the high-resolution features. Second, 3D-CNNs are shown to be more efficient at recovering fine features from nonlocal blocks than 2D structures. Especially for high Cr (
4. Evaluation Result Background Knowledge
We consider the HD video with a spatial resolution of 1080 pixel × 1920 pixel. The peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM)[51] are employed as the metrics to evaluate decoded images. For RGB videos, we first transform them to YUV videos and conduct BMVC encoding and decoding on the Y channel while only performing downsampling for U and V channels in the encoder. In the decoder, BMVC-PnP- or BMVC-E2E-based algorithms are used for the Y channel, and bicubic interpolations are used for the U and V channels, respectively. Metrics are only calculated for the restored Y channel compared with the ground truth. We evaluate only the quality of the restored Y channel to isolate the effect of the BMVC pipeline from the interpolations of color channels. We benchmark the BMVC pipeline along with other compression methods on static frames from the UVG data set[52] and other standard images. Exemplar test images are shown in Fig. 4, which covers diverse scenes.
Figure 4.Test data set (set 13) we used to evaluate the BMVC pipeline and other compression methods.
Examples of decoded videos using the BMVC pipeline at a wide range of Crs are presented in the Supplementary Materials (Video 1 and Video 2). Decoded gray-scale and RGB videos show the flexibility of BMVC in terms of video format.
4.1. BMVC decoder evaluation at various Crs
Since the compression ratio of BMVC depends on the block size, we consider the following block sizes for the HD video; the corresponding Crs are depicted in Table 1. The BMVC pipeline is tested for a wide range of Cr, ranging from 24 to 150. The average PSNR and SSIM of the decoded images (with two different BMVC decoders) in the test set are shown in Table 1 along with other related image compression methods, detailed in the next subsections.
Cr ( | 150 | 120 | 100 | 80 | 72 | 60 | 50 | 40 | 32 | 24 |
22.89, 0.682 | 24.32, 0.707 | 25.97, 0.745 | 27.30, 0.781 | 27.96, 0.797 | 28.83, 0.822 | |||||
29.08, 0.839 | 29.23, 0.843 | 29.98, 0.854 | 31.05, 0.871 | |||||||
Random DS | 8.95, 0.354 | 9.56, 0.401 | 10.10, 0.431 | 10.88, 0.460 | 11.33, 0.472 | 12.17, 0.491 | 13.30, 0.516 | 14.87, 0.555 | 16.61, 0.609 | 18.58, 0.690 |
Block CS | 26.93, | 27.65, | 27.88, | 28.20, | 28.70, 0.857 | 29.38, | 29.83, 0.870 | 30.70, 0.876 | 31.79, 0.884 | |
JPEG2000 | 30.30, 0.852 | 30.94, 0.862 | 31.42, 0.869 | 31.99, 0.877 | 32.80, 0.888 | 33.41, 0.897 | 33.84, 0.902 | 34.60, 0.911 | 36.45, 0.931 | 37.51, 0.941 |
Table 1. PSNR (top line in each cell in dB) and SSIM (bottom line in each cell) Performance for Different Compression Methods at a Wide Range of Crs on the HD Image of Size
Selected decoded images at representative Crs using BMVC-PnP and BMVC-E2E are presented in Fig. 5, where we can see that for a wide range of Crs, both BMVC-PnP and BMVC-E2E decoders are consistently able to provide decent decoded images with fine details. As expected, the resolution from BMVC starts to degrade as Cr increases. This decreasing trend can be seen from the texts in the zoom-in panels of the “bookcase” and “jockey” examples in Fig. 5. In addition, we also notice the reconstruction artifacts are different for the BMVC-PnP and BMVC-E2E due to different network structures being used.
Figure 5.Decoded image results at various Crs with the proposed BMVC-PnP and BMVC-E2E approaches. The BMVC-E2E results consistently have good decoding quality at both low and high Crs. The BMVC-PnP decoder provides higher image quality for low Crs while producing some denoising artifacts at high Crs.
4.2. BMVC versus other compression methods
We further compare the BMVC pipeline with other image and video compression algorithms: random downsampling (random DS), block-wise CS (block CS), and JPEG2000 compression.
We summarize the decoding results of various methods in Table 1 and Fig. 7. Note that the small peak in the BMVC-E2E PSNR plot (Fig. 7, red) is because the BMVC-E2E decoders have an additional 3D-CNN stage for high Crs. The SSIM metrics of the BMVC pipeline and other methods (in Table 1) follow a similar trend as the PSNR plot. We can see that, at all Cr levels, JPEG2000 performs best, but with the price of a higher encoding cost. BMVC-PnP provides a higher PSNR than BMVC-E2E when
Figure 6.Comparison of the BMVC pipeline with other image compression methods: random DS, block CS, and JPEG2000 compression. For the random DS and block CS experiments, we implemented their decoders based on the PnP algorithm with FFDNet as the flexible denoiser. Results are shown with a low
Figure 7.PSNR performance of different compression methods at a wide range of Crs. PSNR value is computed for Y channels only. The BMVC-E2E has a PSNR increase at
4.3. Computation cost: encoder and decoder
A key advantage of the BMVC pipeline and other CS-based methods over transform-based (DCT, DWT) image compression is its ultralow-cost encoder, which makes it perfect on resource-limited mobile platforms. Here we evaluate the computation cost of the encoder from the compression methods discussed above and shown in Table 2. The computation cost is evaluated by the number of required additions and multiplications when encoding a single image of
For the BMVC pipeline, let
The random DS approach has the lowest possible encoder complexity, which does not include any calculation but is simply a reading out of a subset of pixel values from the detector. Such a simple encoding pipeline requires no addition or multiplication.
For block CS, let
JPEG2000 compression is based on wavelet transform with a flexible basis size. The encoding process of JPEG2000 consists of color space conversion, color channel downsampling, wavelet transform, quantization, and entropy coding. For a fair comparison, we analyze the most computationally intense bulk part of the encoder: 2D wavelet transforms on small image blocks (Y channel only). When using a block size of
Note that JPEG has the encoder cost of JPEG2000 by setting
Codec | Encoder Cost | Dynamic Range of | Mask/Basis | Decoder | Comment |
BMVC | # addition: | PnP: iterative networkE2E: feedforward network | |||
Random DS | # addition: 0# multiplication: 0 | 1 | PnP: iterative network | ||
Block CS | # addition: | PnP: iterative network | Cr = | ||
JPEG2000 | # addition: | 1 | flexible | Discrete wavelet transform (DWT) |
Table 2. Computation Cost of Different Compression Methods.
4.4. Robustness to quantization bits
In real-world applications, the encoded data will have to be quantized before sending them to the receiver.
High dynamic range data are prone to be degraded after bit quantization. As shown in the “dynamic range” column in Table 2, our BMVC-encoded blocks range between 0 and
To test how robust the BMVC pipeline is to quantization bits, we conducted 8–16 bit quantization to the encoded/compressed blocks before sending them to the two BMVC decoders and the block CS decoder. Decoded results are evaluated by the PSNR metric. Table 3 and Fig. 8 illustrate how BMVC and block CS perform under different quantization bits. As expected, the two BMVC decoders are robust to quantization bits. The difference in PSNR is generally below 0.1 dB, with only one exception for the BMVC-PnP at
BMVC-PnP (dB) | BMVC-E2E (dB) | Block CS (dB) | ||||||||||
Bit/Cr | 100 | 80 | 50 | 24 | 100 | 80 | 50 | 24 | 100 | 80 | 50 | 24 |
8-bit | 25.921 | 27.274 | 28.829 | 33.451 | 27.785 | 28.796 | 29.060 | 31.023 | 27.190 | 27.049 | 27.361 | 29.563 |
10-bit | 25.981 | 27.305 | 29.882 | 33.592 | 27.805 | 28.813 | 29.082 | 31.056 | 27.634 | 27.957 | 29.361 | 31.931 |
12-bit | 25.981 | 27.301 | 29.889 | 33.611 | 27.803 | 28.818 | 29.087 | 31.059 | 27.800 | 28.123 | 30.090 | 33.132 |
14-bit | 25.982 | 27.303 | 29.892 | 33.613 | 27.802 | 28.819 | 29.088 | 31.060 | 27.813 | 28.132 | 30.099 | 33.149 |
16-bit | 25.982 | 27.304 | 29.879 | 33.576 | 27.803 | 28.818 | 29.088 | 31.060 | 27.814 | 28.133 | 30.102 | 33.151 |
ΔPSNR↓ | 0.0606 | 0.0315 | 0.0629 | 0.1627 | 0.0202 | 0.0230 | 0.0282 | 0.0370 |
Table 3. PSNRs of BMVC-PnP, BMVC-E2E, and Block CS under Different Quantization Bits.
Figure 8.Evaluation of robustness to quantization bits. BMVC and block CS both show high PSNR performance when the dynamic range of the data is intact. In practice, quantization will affect the codec performance in real-world video signal transmission. The bar plots indicate how the three decoders (BMVC-PnP, BMVC-E2E, and block CS) perform under different quantization bits. BMVC decoders have consistent performance regardless of data quantization. However, block CS has noticeable decreases in PSNR at 10-bit and 8-bit quantization.
4.5. Ablation study of BMVC-E2E
We conduct an ablation study of the BMVC-E2E decoder structure to figure out the optimal design that balances well between decoder runtime and quality. Using
Decoder Structure | PSNR | Runtime |
2 × U-Net + 2 × 3D-CNN (RevSCI) + 1 × deeper 3D CNN (RevSCI) | 31.720 | |
2 × U-Net + 2 × 3D-CNN (RevSCI) | 31.449 | |
2 × U-Net | 17.930 | |
1 × U-Net | 10.308 |
Table 4. Ablation Study of BMVC-E2E (
5. Discussion and Conclusions
In recent years, DL-based codecs have made significant advancements. The delay cognizant video coding (DCVC) framework, in particular, has shown promising results. However, when comparing it with BMVC, there are notable differences in their application scenarios and respective strengths and weaknesses. DCVC represents a sophisticated conditional coding framework that leverages temporal context features as conditional inputs to enhance both encoding and decoding processes. By utilizing these temporal context features, DCVC effectively taps into the potential of conditional coding to improve encoding efficiency and reconstruction quality. Specifically, the inclusion of high-dimensional features helps in preserving high-frequency details in reconstructed videos, leading to enhanced visual fidelity. However, the integration of high-dimensional features and conditional coding mechanisms may impose higher computational demands. More specifically, for the encoding process, the latest DCVC work, DCVC-FM, first employs an optical flow network to estimate the motion vector. Then the motion vector is used to extract the temporal context for both encoding and decoding. In contrast, BMVC adopts a frame-independent pipeline with the primary objective of minimizing computation costs on encoder hardware. BMVC’s encoding process can be efficiently implemented as additions of pixel readouts according to a predefined look-up table, by which BMVC can achieve robust performance across diverse video content while maintaining real-time encoding capabilities, crucial for applications with stringent computational constraints.
To conclude, we have proposed a brand-new BMVC codec that has an ultralow-cost encoder, which is perfect for resource-limited platforms such as robotics and drones. We have also developed two BMVC decoders, based on PnP optimization and E2E neural networks. The BMVC-PnP decoder is an iterative algorithm that has proved convergence. It is also flexible and robust to different compression ratios. The BMVC-E2E decoder has an unrolled structure of the PnP approach with very few stages. Its feed-forward nature enables real-time decoding of 1080p image sequences on a single GPU, achieving
Unlike traditional image and video compression algorithms that embed prior knowledge in the encoding process via optimal basis selection, the BMVC pipeline takes a different design philosophy. We only fully utilize the prior knowledge in the decoding process while keeping the encoding process as simple as possible. As a result, we can keep the computation cost of the encoder as low as possible so that we can save power, computation, and bandwidth on resource-limited platforms. On the other hand, the decoding process of BMVC is ill-posed so that it requires strong prior knowledge about natural images to reliably retrieve the desired frames. This is achieved via either a PnP framework with an image-denoising CNN or an E2E neural network trained on massive data sets.
One aspect we have not explored is using temporal information across frames to further compress the data stream. It is true that state-of-the-art video codecs take advantage of the temporal redundancy of video data. We chose to make the BMVC pipeline a frame-independent codec for three reasons: low complexity, low latency, and constant bandwidth. To compute the differential signal between frames will cost extra power consumption, which is not in line with our original purpose of developing BMVC. Another issue raised by processing a bunch of frames is that it will increase the latency of the E2E pipeline. The third advantage of the frame-independent BMVC pipeline is its constant bit rate, which differs from the content-dependent bandwidth of MPEG-based video codecs.
Recall again that BMVC is not designed to replace the current codec standard, but to provide an alternative option for platforms under extreme power/computation constraints. In fact, we have managed to implement the BMVC encoding pipeline on a drone platform with a simple Raspberry Pi development board. Future work will be around building the E2E BMVC pipeline to achieve optimal performance (both speed and quality) and minimal latency. With this ultralow-cost encoder design and two options for BMVC decoders, we envision the BMVC pipeline could be a revolutionary technique for video signal compression and transmission on robotic platforms.
References
[3] W. Shi et al. Edge computing: vision and challenges. IEEE Internet Things J., 3, 637(2016).
[5] E. K. Ryu et al. Plug and-play methods provably converge with properly trained denoisers, 5546(2019).
[6] S. V. Venkatakrishnan, C. A. Bouman, B. Wohlberg. Plug-and-play priors for model based reconstruction, 945(2013).
[8] X. Yuan et al. Plug-and-play algorithms for large-scale snapshot compressive imaging, 1447(2020).
[10] D. L. Donoho. Compressed sensing. IEEE Trans. Inf. Theory, 52, 1289(2006).
[12] P. Llull et al. Coded aperture compressive temporal imaging. Opt. Express, 21, 10526(2013).
[13] X. Yuan et al. Low-cost compressive sensing for color video and depth, 3318(2014).
[14] M. Qiao et al. Deep learning for video compressive sensing. APL Photonics, 5, 030801(2020).
[16] Z. Meng, J. Ma, X. Yuan. End-to-end low cost compressive spectral imaging with spatial-spectral self-attention, 187(2020).
[18] P. Llull et al. Image translation for single shot focal tomography. Optica, 2, 822(2015).
[19] T.-H. Tsai et al. Spectral-temporal compressive imaging. Opt. Lett., 40, 4054(2015).
[22] Y. Sun, X. Yuan, S. Pang. Compressive high-speed stereo imaging. Opt. Express, 25, 18182(2017).
[24] A. Buades, B. Coll, J.-M. Morel. A non-local algorithm for image denoising, 60(2005).
[28] X. Yuan. Generalized alternating projection based total variation minimization for compressive sensing, 2539(2016).
[30] X. Miao et al. λ-net: Reconstruct hyperspectral images from a snapshot measurement, 4059(2019).
[31] Z. Cheng et al. Memory-efficient network for large-scale video compressive sensing, 16246(2021).
[34] J. Ma et al. Deep tensor ADMM-Net for snapshot compressive imaging, 10223(2019).
[35] L. Wang et al. Hyperspectral image reconstruction using a deep spatial-spectral prior, 8024(2019).
[36] Y. Li et al. End-to-end video compressive sensing using Anderson-accelerated unrolled networks, 1(2020).
[37] T. Huang et al. Deep Gaussian scale mixture prior for spectral compressive imaging, 16216(2021).
[38] K. Greger, Y. LeCun. Learning fast approximations of sparse coding, 399(2010).
[39] Y. Yang et al. Deep ADMM-Net for compressive sensing MRI. Advances in Neural Information Processing Systems, 29, 10(2016).
[40] C. A. Metzler, A. Mousavi, R. G. Baraniuk. Learned D-AMP: principled neural network based compressive image recovery, 1770(2017).
[47] J. R. Hershey, J. L. Roux, F. Weninger. Deep unfolding: model-based inspiration of novel deep architectures(2014).
[48] O. Ronneberger, P. Fischer, T. Brox. U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention (MICCAI), 9351, 234(2015).
[49] S. Nah, T. Hyun Kim, K. Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring, 3883(2017).
[50] F. Perazzi et al. A benchmark dataset and evaluation methodology for video object segmentation, 724(2016).
[52] A. Mercat, M. Viitanen, J. Vanne. UVG dataset: 50/120 fps 4k sequences for video codec analysis and development, 297(2020).
[55] J. Zhang, B. Ghanem. ISTA-Net: interpretable optimization-inspired deep network for image compressive sensing, 1828(2018).
[57] Lu Gan. Block compressed sensing of natural images, 403(2007).

Set citation alerts for the article
Please enter your email address