Infrared and Visible Image Fusion Method via Interactive Attention-based Generative Adversarial Network

Zhishe WANG; Wenyu SHAO; Fengbao YANG; Yanlin CHEN

doi:10.3788/gzxb20225104.0410002

Abstract

Infrared sensors can capture prominent target characteristics by thermal radiation imaging, however the obtained infrared images usually lack structural features and texture details. On the contrary, visible sensors can obtain rich scene information by light reflection imaging, the obtained visible images have high spatial resolution and rich texture details, but cannot effectively perceive target characteristics, especially in low illumination environmental conditions. Infrared and visible image fusion aims to integrate the advantages of the two types of sensors to generate a composite image with better target perception and superior scene representation, which is widely applied for object tracking, object detection and pedestrian re-recognition. The existing generative adversarial network-based fusion methods only make use of convolution operation to extract local features, but do not consider their long-range dependence, which is easy to cause the fusion imbalance, resulting in the fusion image cannot retain typical targets of infrared image and texture details of visible image at the same time. To this end, an end-to-end infrared and visible image fusion method via interactive attention-based generative adversarial network is proposed. Firstly, in the generative network model, we adopt a dual-path encoder architecture with weight parameters sharing to extract the respective multi-scale deep features of source images, where the first normal convolution layer is used to extract low-level features, and two multi-scale aggregation convolution models are adopted to extract high-level features. By aggregating multiple available receptive fields, our multi-scale dual-path encoder network can efficiently extract more meaningful information for fusion tasks without down-sample or up-sample operations. Secondly, in the fusion layer, we design an interactive attention fusion model, which is cascading channel and spatial attention models, to establish the global dependence of their local features from the channel and spatial dimensions. The obtained attention maps can refine multi-scale feature maps to more focus on typical infrared targets and visible texture details, so that the fused results achieve better visual results. Finally, in the adversarial network model, we propose two discriminators, such as Discriminator-IR and Discriminator-VIS, to balance the truth-falsity between fusion image and source images. Besides, we introduce the mutually-compensated loss function to supervise the entire network, which can gradually optimize the generative network model to obtain the best fused result. In the ablation study and verified experiments, the TNO and Roadscene datasets and eight evaluation metrics are proposed to demonstrate the effectiveness and superiority of the proposed method. The ablation experimental results of the interactive attention fusion model indicate that our model can effectively establish the global dependency of local features compared with other four models, and further improve infrared and visible image fusion performance. In addition, compared with other nine the state-of-the-art fusion methods, such as WLS, DenseFuse, IFCNN, SEDRFuse, U2Fusion, PMGI, FusionGAN, GANMcC and RFN-Nest, the proposed method can achieve more balanced fusion results in retaining the typical targets of infrared image and rich texture details of visible image, and has a better visual effect, which is more suitable for the human visual system. Meanwhile, from a multi-index evaluation perspective, the proposed method has better image fusion performance, higher computational efficiency and stronger robustness than other state-of-the-art fusion methods.

微信扫一扫：分享

微信扫一扫：分享