Enhancing Remote Sensing Image Unsupervised Hashing Cross-modal Correlation with Similarity Matrix

Haoran LI; Wei XIONG; Yaqi CUI; Xiangqi GU; Pingliang XU

doi:10.3788/gzxb20235201.0110003

Abstract

With the continuous enrichment of satellite-borne and airborne remote sensing detection methods, the types of remote sensing data obtained are more diverse and the data scale is constantly expanding, which strongly drives the development of cross-modal correlation methods in the field of remote sensing. Cross-modal retrieval task refers to retrieving relevant data from other modes according to the query samples in a given mode. Multi-modal data usually includes images, text, video, audio, etc. Remote sensing image and text are important components of intelligence information, and the establishment of correlation between remote sensing image and text information is of great significance to the effective use of multi-source intelligence data. The mutual verification of the two is helpful to further improve the reliability of acquiring intelligence information. Remote sensing data usually contains rich information, but getting serviceable knowledge from the massive data effectively can be very challenging. With the continuous development of deep learning, deep neural networks are more and more widely used to obtain feature representations of different modes. Mapping cross-modal information into the same feature space is helpful to solve the “heterogeneous gap” problem among different modalities. Hashing method achieves fast retrieval speed and high efficiency. With the growth of remote sensing data type and scale, it has attracted more and more attention in the field of remote sensing cross modal. However, the existing unsupervised deep hashing cross-mode methods still have some problems. Usually, the similarity information across modes is learned separately, and without the assistance of label information, the model cannot obtain the semantic correlation between different modes correctly and effectively. In addition, most deep hash methods generate hash codes directly from the original features obtained by deep neural networks, and the generated hash features are difficult to obtain satisfactory discrimination information. In general, data from different modalities could give people a comprehensive description of the same object. As a result of the applicability and flexibility of cross modal retrieval, multiple methods of it have been widely explored in the computer vision community. In recent years, some studies have been conducted on cross-modal retrieval in the field of remote sensing. But most of the existing cross-modal correlation methods in remote sensing field are based on real value representation, which has problems of slow correlation retrieval speed and large memory consumption. However, the hash coding method can effectively improve the efficiency of association retrieval and is more suitable for large-scale and rapid association retrieval tasks. However, some semantic information will be lost during the transformation of the hash code. Therefore, this paper proposes an unsupervised hash-cross-modal association method for remote sensing images assisted by a similarity matrix. The constructed original feature and the similarity matrix of the hash feature are used to integrate the semantic correlation information between different modes, so as to preserve the semantic correlation within modes and between different modes as much as possible, and reduce the loss of feature information when the original feature is converted to hash code through semantic alignment between similarity matrices. The loss function uses the weighted sum of the similarity matrix loss and contrast loss. The combination of the two effectively improves the accuracy of unsupervised cross-modal hash association, which is more suitable for large-scale cross-modal remote sensing image association retrieval tasks. Experimental results on benchmark datasets in the remote sensing field show that the proposed method performs better than the existing benchmark method. However, the design of the original feature extraction module of the model does not fully consider the semantic information richness of each mode, and the calculation method of the similarity matrix is relatively simple. Future work can be further improved on the accuracy of association.

微信扫一扫：分享

微信扫一扫：分享