Remote Sensing Image Caption Method Based on Attention and Reinforcement Learning

Yuanjun Nong; Junjie Wang

doi:10.3788/AOS202141.2228001

Abstract

The current remote sensing object detection methods, only identifying the category and location of remote sensing objects, cannot generate text caption related to the contents of remote sensing images. A remote sensing image caption method based on attention and reinforcement learning is proposed in this paper to solve this problem. First, the convolution neural network is used to construct an encoder and thereby extract remote sensing image features. Secondly, a decoder is built through the long short-term memory network to learn the mapping relationships of the image features with text semantic features. Thirdly, the attention mechanism is introduced to enhance the attention of the model on salient features and reduce the interference of irrelevant background features. Finally, the reinforcement learning strategy is adopted to optimize the model directly according to the discrete and non-differentiable evaluation indexes and thus to eliminate the defects of exposure bias and inconsistent optimization directions. Experimental results of public data sets of remote sensing image caption show that the method achieves high detection accuracy and has good caption performance for remote sensing images in complex environments such as dense small targets, fog accumulation, and similar background and object features.