• Acta Photonica Sinica
  • Vol. 51, Issue 2, 0210007 (2022)
Tianwei YU1, Enrang ZHENG1、*, Junge SHEN2, and Kai WANG3
Author Affiliations
  • 1School of Electrical and Control Engineering,Shaanxi University of Science and Technology,Xi′an710021,China
  • 2Unmanned System Research Institute,Northwestern Polytechnical University,Xi′an710072,China
  • 3Henan Key Laboratory of Underwater Intelligent Equipment,Zhengzhou 450000,China
  • show less
    DOI: 10.3788/gzxb20225102.0210007 Cite this Article
    Tianwei YU, Enrang ZHENG, Junge SHEN, Kai WANG. Optical Remote Sensing Image Scene Classification Based on Multi-level Cross-layer Bilinear Fusion[J]. Acta Photonica Sinica, 2022, 51(2): 0210007 Copy Citation Text show less

    Abstract

    Remote sensing, a kind of detection technology, provides non-contact surface observation through sensor platform. With the rapid development of unmanned aerial vehicle, remote sensing and satellites technology, quantitative remote sensing images with higher resolution can be generated. Compared with medium and low-resolution remote sensing images, these high-resolution remote sensing images contain richer ground objects and spatial details, which can express the spatial structure and texture features of ground object more clearly, providing good conditions and foundation for remote sensing image interpretation and analysis. Therefore, high-resolution remote sensing images have become an important data source for fine earth observation. The scene classification of high-resolution remote sensing image refers to the analysis of extracted remote sensing image information, dividing the scene image of interest to different categories, such as forest, river, railway, etc., and is widely applied in environmental monitoring, urban planning, military object detection, global climate change research and other fields. Unlike general natural images, the geometry structure and space pattern of remote sensing images are highly complex, and there are also problems such as complex background and many types, which is a great challenge for effectively describing remote sensing image content. In addition, as a result of the complexity and diversity of remote sensing image scenes, different scenes may contain almost the same ground object targets, or the same scene may contain different ground object targets. At this regard, how to design discriminative feature representation to describe the image directly affects the quality of scene classification. In the past few decades, many approaches have been proposed, and most of these methods can be divided into two main categories. The traditional scene classification methods, such as Scale Invariant Feature Transform (SIFT), Histogram of Oriented Gradients (HOG) and Color Histogram (CH), mainly use hand-crafted feature but highly depend on the priori knowledge of the designer, resulting in the features with low-level semantics and limited representational capacity. By contrast, convolutional neural network has been successfully applied in remote sensing scene classification as its excellent feature self-learning ability. It can learn features directly from data without the need of priori knowledge of the designer. However, the accuracy of the scene classification approach based on CNN largely depends on the network structure and due to the complex spatial patterns, large inter-class similarity and high intra-class diversity of remote sensing scene images, the scene classification accuracy is severely limited. To address above issues, a novel remote sensing image scene classification algorithm via multilevel cross-layer bilinear fusion is proposed. Firstly, ResNet50 avoids the issues of model overfitting and gradient vanishing. It is employed to extract the remote sensing image multi-level features. In this way, the four multi-scale multi-level features of conv2_x, conv3_x, conv4_x and conv5_x layers were extracted by ResNet50 model. The dilated convolution with different expansion rates can perceive scene information at multiple spatial scales, promoting the network to acquire features at different scales. The context features at multiple spatial scales are extracted by setting the expansion rate of dilated convolution to different values. Then, the scene semantics of the feature information is enriched by serial fusion of multi-scale features. Since features at different levels contain different types of information, the high-level features provide global semantic information, which is helpful to identify and locate objects in the image. On the contrary, the low-level features contain rich local spatial information to refine and enrich the internal structure of salient objects. Such features can help high-level features to complement their loss of spatial information, which is beneficial for classification. The global context information of an image has a global receptive field. Considering the global information, the scene category can be inferred and the interference of background details can be filtered. By taking the advantages of low-level, high-level, and global context features, a multilevel attention feature fusion module is presented, which can effectively enhance the feature extraction capability of the model. The spatial attention is designed to focus on the key location of the scene image, which adaptively learn the importance of different image regions, depressing the irrelevant information of background. The global context information is integrated into the feature fusion process of low-level local features and high-level semantics features to realize the complementary feature information of each level, resulting in pleasing scene classification accuracy. Finally, inspired by fine-grained visual classification, a cross-layer bilinear fusion method is utilized to perform layered fusion of multilevel features, and the fused features are used for classification. Hadamard product operation at any two different levels is utilized to extract second-order bilinear information. Based on this cross-layer modeling to capture the association between local features, the hierarchical feature interaction and efficient information integration can be achieved, and the deep semantic information and shallow texture information contained in different hierarchical features are fully aggregated. Moreover, compared with the traditional bilinear pooling method, the Hadamard product is the product of two matrices′ corresponding elements, which does not change the dimension of the matrix, effectively solved the dimension explosion caused by the outer product operation. Through extensive experiments conducted on the UCM, AID and PatternNet datasets, the effectiveness of the proposed method is verified. Compared with other advanced approaches, the proposed method achieves more excellent classification performance. On the UCM dataset, for training with 80% data, the overall accuracy reached 99.32%, and the classification accuracy is increased by 0.75% compared with GBNet. On the AID dataset, the proposed method achieved 95.84% accuracy in 50% of training samples, with an improvement of 2.74% compared with ARCNet. On the PatternNet dataset, 50% of the samples are trained, and the overall accuracy is 99.6%, that has increased by 0.02% compared with SDAResNet.
    Tianwei YU, Enrang ZHENG, Junge SHEN, Kai WANG. Optical Remote Sensing Image Scene Classification Based on Multi-level Cross-layer Bilinear Fusion[J]. Acta Photonica Sinica, 2022, 51(2): 0210007
    Download Citation