• Acta Photonica Sinica
  • Vol. 52, Issue 4, 0428002 (2023)
Wensheng FAN, Fan LIU*, and Ming LI
Author Affiliations
  • College of Data Science, Taiyuan University of Technology, Jinzhong 030600, China
  • show less
    DOI: 10.3788/gzxb20235204.0428002 Cite this Article
    Wensheng FAN, Fan LIU, Ming LI. Remote Sensing Image Fusion Based on Two-branch U-shaped Transformer[J]. Acta Photonica Sinica, 2023, 52(4): 0428002 Copy Citation Text show less

    Abstract

    Multi-spectral images are key references for earth observation. However, capturing rich spectral information introduces limited spatial resolution in multi-spectral imaging. To overcome the trade-off between spatial resolution and spectral resolution in remote sensing, panchromatic images with high spatial resolution but poor spectral information are adopted to complement multi-spectral imagery. As a result, the technique of fusing high-resolution panchromatic images and low-resolution multi-spectral images, namely pan-sharpening, is developed and facilitates various remote sensing application. Existing pan-sharpening methods can be roughly divided into four main categories: component substitution, multi-resolution analysis, variational optimization and deep learning. Each category has its own fusion strategy. Recently, a number of deep-learning-based methods are developed and obtain superior performance on fusion quality. These methods are typically based on convolutional neural networks and even combine the idea of generative adversarial networks. However, the inadequate extraction of global contextual and multi-scale features always leads to a loss of spectral information and spatial details. To solve this problem, a two-branch u-shaped transformer is proposed in this paper. Firstly, the multi-spectral and panchromatic images to be fused are partitioned into non-overlapping patches with fixed patch sizes, and each patch is embedded into a vector. The embedding vectors have the same feature dimension and contain the rich spectral and spatial information of the image patches. Subsequently, the embedding vectors of the multi-spectral and panchromatic images are fed into the two branches of the transformer encoder to extract hierarchical feature representations, respectively. The encoder consists of shifted windowing transformer blocks and patch emerging layers. Therefore, it can fully extract global and multi-scale features. In the encoding process, hierarchical panchromatic feature representations are injected into multi-spectral feature representations to obtain hierarchical fused feature representations. Besides, the high-level features are further fused through a transformer-based bottleneck. The transformer decoder progressively up-samples the high-level fused feature representation via patch expanding layers and suppresses redundant features via feature compression layers. In the decoding process, the hierarchical representations from the encoder are aggregated with the high-level fused feature representation via skip connections to avoid information loss. Finally, the decoder produces a high-resolution fused feature representation. Rearrangement and transposed convolution operations are used to reconstruct the desired high-resolution multi-spectral image from embedded patches. To validate the effectiveness of the proposed method, extensive experiments are conducted on three datasets acquired by Gaofen-2, QuickBird and WorldView-3 satellites. Since the ground-truth high-resolution multi-spectral image is non-existent, the multi-spectral and panchromatic images are spatially degraded according to the Wald's protocol, and the original multi-spectral images can be used as the reference images to supervise the training of the proposed network. The proposed network is trained for 500 epochs by using an AdamW optimizer. The mean absolute error between the reference image and the fusion result is used as a loss function to guide the optimization of the proposed network. To evaluate the fusion results, four full-reference indices are adopted for testing at the reduced resolution. One no-reference index with its spectral distortion and spatial distortion components is used for testing at the full resolution. Since the feature dimension of embedding vectors and the size of partitioned patches are important hyper-parameters that affect the performance and the computational complexity of the proposed method. Several model variants are built to observe the impact of the two hyper-parameters. The variant with an embedding vector dimension of 192 and a patch size of 4 has the best fusion results. However, compared with its huge computational cost, the improvement of fusion results is relatively limited. Therefore, the embedding vector dimension is set to 128 and the patch size is set to 4 in this paper. Subsequently, the proposed method is compared with eight widely used fusion methods to verify its effectiveness. With the original multi-spectral image as the reference image, the methods are firstly compared on images acquired by the three satellites at reduced resolution. Through visual results and residual maps between the fusion results and the reference image, it can be observed that the proposed method obtains the best visual quality and the smallest errors. As for quantitative evaluation via the objective indices, the proposed method also has the best quantitative results in terms of all the indicators on all the three kinds of testing data. Next, all the methods are compared on the original images acquired by the three satellites at full resolution. The visual result of the proposed method shows better preservation of both spectral information and spatial details than those of other methods. As for objective quantitative evaluation, the proposed method obtains the best results in terms of all the metrics on the Gaofen-2 data. On the QuickBird and WorldView-3 data, the proposed method shows better values than other methods on the spatial and overall indices. In conclusion, the reduced-resolution and full-resolution experimental results on the three data sets demonstrate that the proposed method outperforms other methods in terms of both subjective visual effect and quantitative metrics.
    Wensheng FAN, Fan LIU, Ming LI. Remote Sensing Image Fusion Based on Two-branch U-shaped Transformer[J]. Acta Photonica Sinica, 2023, 52(4): 0428002
    Download Citation