Combining Convolutional Attention Module and Convolutional Auto-encoder for Detail Injection Remote Sensing Image Fusion

Ming LI; Fan LIU; Jingzhi LI

doi:10.3788/gzxb20225106.0610005

Abstract

Panchromatic and multispectral images can be captured by Earth observation satellites. Usually, panchromatic images have high spatial resolution and low spectral resolution, while multispectral images have low spatial resolution and high spectral resolution. For combining the spatial and spectral information of panchromatic and multispectral images, remote sensing image fusion techniques are applied and born. Although significant progress has been made in fusion algorithms, there are still problems of spectral distortion and insufficient details. To solve the above problems, this paper proposes to design a new remote sensing image fusion algorithm with convolutional auto-encoders, attention mechanism and filter as the detail processing module and additive injection fusion rule as the fusion module. Convolutional auto-encoders learns the nonlinear mapping relationship between the low-resolution image and the high-resolution image, and the high-resolution image corresponding to the low-resolution image can be obtained after the training is completed. The introduction of attention Mechanism in the convolutional auto-encoders can improve the sensitivity of the network to information and increase the channel importance of image information. The filter plays two roles in this paper, one is to obtain the high or low frequency information of the image through the filter, and the other is to obtain the low-resolution image corresponding to the high-resolution image. The specific steps are described below. First, high-frequency images of low-resolution images and high-frequency images of high-resolution images for model training are acquired separately using Gaussian filters, while high-frequency images of low-resolution multispectral image for model prediction are acquired; then, the non-linear mapping relationship between the high-frequency image of low-resolution image and the high-frequency image of high-resolution image is learned by using convolutional auto-encoders; finally, the missing detail information of the multispectral image, i.e., the high-frequency image of high-resolution multispectral, is obtained using the convolutional auto-encoders completed by training, and fused with the original image to generate the high-resolution multispectral image. For the filter selection, experiments are conducted based on the mean filter, Laplace filter, Gaussian filter and morphological filter in this paper, and the results show that using the Gaussian filter has a better fusion effect. At the same time, experiments were conducted on the selection of the number of iterations of the network model. In this paper, the objective metrics of fused images with the different number of iterations are recorded. Since the objective indicators are floating in nature, a fitting function is used to fit the data to the objective indicators. The influence of the number of iterations on the fusion results is found by observing the trend of the fitting curve. The fitting curves show that the fusion algorithm proposed in this paper obtains the best fused image at about 1 600 iterations. This paper combines the respective advantages of Convolutional Auto-Encoders, attention mechanism and filter to perform experiments on two datasets, which are images taken by QuickBird and SPOT satellites, respectively. The resolution of the datasets is 512×512 for multispectral and 512×512 for panchromatic images. To expand the training dataset, the datasets are cropped to 8×8 size images by using a sliding window. In training the model training batch size is 256, the number of training iterations is 1 600, and the optimizer Adadelta is used for network model parameter optimization and learning rate adaptive optimization. To demonstrate the effectiveness of the algorithm proposed in this paper, it is compared with the classical fusion algorithm. Since this paper uses the additive injection of fusion rules, IHS and BDSD additive fusion algorithms are selected for comparison. PNN and GAN are typical deep learning fusion algorithms and are compared with classical deep learning fusion algorithms to demonstrate the effectiveness of the proposed fusion algorithm. The comparison with the CAE fusion algorithm can effectively prove the effectiveness of the attention mechanism and filter introduced in this paper, which can significantly improve the fused image effect. Di-PNN fusion algorithm and SR-D fusion algorithm are both detail injection fusion algorithms based on deep learning networks, and the comparison with Di-PNN and SR-D fusion can illustrate the effectiveness of the network structure in this paper. In this paper, the results of different fusion algorithms are compared in terms of subjective visual and objective metrics. The objective metrics are CC, UIQI, ERGAS, RASE, AG and SAM, where the UIQI and AG metrics describe the detail information of the image, and the ERGAS, RASE, SAM and CC metrics describe the spectral information of the image. the larger the CC, UIQI and AG metrics, the better the image quality; the smaller the ERGAS, RASE and SAM metrics, the better the image quality. By comparing with the classical fusion algorithm and using subjective visual and objective metrics, the experimental results show that the fused images in this paper retain more spectral information and detail information and show good performance both subjectively and objectively.