Scene Classification of Optical High-resolution Remote Sensing Images Using Vision Transformer and Graph Convolutional Network

Jianan WANG; Yue GAO; Jun SHI; Ziqi LIU

doi:10.3788/gzxb20215011.1128002

Abstract

Most existing optical remote sensing scene classification methods based on convolutional neural network mainly perform global feature learning and fail to consider the local features in the scene, which cannot effectively address the large intraclass difference and high interclass similarity. Therefore, a novel remote sensing scene classification method based on two branches of vision transformer and graph convolution network is proposed. Firstly the scene image is divided into patches and the then positional encoding and vision transformer are used to encode the patches. Consequently, the long-range dependencies can be mined. On the other hand, the scene image is converted into superpixels. The convolutional neural networks features of each superpixel are pooled and used to represent the node of the graph structure. Then the graph convolutional network is applied to model the spatial topology relationships. Finally the final feature representation of the scene image are described by the features of the two branches. Experimental results on the optical remote sensing image datasets demonstrate the effectiveness of our method.