A Bridge Neural Network-Based Optical-SAR Image Joint Intelligent Interpretation Framework

The current interpretation technology of remote sensing images is mainly focused on single-modal data, which cannot fully utilize the complementary and correlated information of multimodal data with heterogeneous characteristics, especially for synthetic aperture radar (SAR) data and optical imagery. To solve this problem, we propose a bridge neural network(BNN-) based optical-SAR image joint intelligent interpretation framework, optimizing the feature correlation between optical and SAR images through optical-SAR matching tasks. It adopts BNN to effectively improve the capability of common feature extraction of optical and SAR images and thus improving the accuracy and application scenarios of specific intelligent interpretation tasks for optical-SAR/SAR/optical images. Specifically, BNN projects optical and SAR images into a common feature space and mines their correlation through pair matching. Further, to deeply exploit the correlation between optical and SAR images and ensure the great representation learning ability of BNN, we build the QXS-SAROPT dataset containing 20,000 pairs of perfectly aligned optical-SAR image patches with diverse scenes of high resolutions. Experimental results on optical-to-SAR crossmodal object detection demonstrate the effectiveness and superiority of our framework. In particular, based on the QXSSAROPT dataset, our framework can achieve up to 96% high accuracy on four benchmark SAR ship detection datasets.


Introduction
With the rapid development of deep learning, remarkable breakthroughs have been made in deep learning-based land use segmentation, scene classification, object detection, and recognition in the field of remote sensing in the past decade [1][2][3][4]. This is mainly due to the powerful feature extraction and representation ability of deep neural networks [5][6][7][8], which can well map the remote sensing observations into the desired geographical knowledge. However, the current mainstream interpretation technology for remote sensing images is still mainly focused on single-modal data and cannot make full use of the complementary and correlated information of multimodal data from different sensors with heterogeneous characteristics, resulting in insufficient intelligent interpretation capabilities and limited application scenarios. For example, optical imaging is easily restricted by illumination and weather conditions, based on which accurate interpretation cannot be obtained at night or under complex weather with clouds, fog, and so on. Compared with optical imaging, synthetic aperture radar (SAR) imaging can achieve full-time and all-weather earth observations. However, due to the lack of texture features, it is difficult for SAR images to be interpreted even by well-trained experts. Therefore, gathering sufficient amounts of training SAR data with diverse scenes and accurate labeling is a challenging problem, which heavily affects the deep research and application of SAR images based on intelligent interpretation. To address the above issues, multimodal data fusion [9][10][11][12] becomes one of the most promising application directions of deep learning in remote sensing, especially the combined utilization of SAR and optical data because these data modalities are completely different from each other in terms of geometric and radiometric appearance [13][14][15][16][17].
However, the existing optical-SAR fusion techniques mainly concentrate on the matching problem. The proposed optical-SAR image matching methods can be divided into three types: signal-based, hand-crafted feature-based, and deep learning-based approaches. For the signal-based similarity measures, crosscorrelation (CC) [18] and mutual information (MI) [19,20] have been widely used to the optical-SAR matching tasks. Since MI is an intensity-based statistical measure and has good adaptability to geometric and radiometric changes, it is extensively outperformed CC in optical-SAR image matching. Nevertheless, signal-based approaches that do not contain any local structure information are not robust and accurate enough for matching multisensor images. Feature-based methods commonly utilize invariant key points and feature descriptors. The reason why it comes to a better result may stem from the feature descriptors, which are less sensitive to the geometric and radiometric changes. Many traditional handcrafted methods have been proposed for optical-SAR image matching, such as SIFT [21], SAR-SIFT [22], and HOPC [23,24]. However, considering highly divergence between SAR and optical images and the computing ability, the hand-crafted feature-based matching approaches are quite limited to get a further step. Because of the powerful feature extraction and representation learning ability, exploiting convolution neural networks to extract the deep features has achieved a high matching accuracy. The mainstream architecture for optical-SAR image matching is Siamese network [25][26][27][28], which is composed of two identical convolutional streams. The dual network is used to extract deep characteristic information of input image pairs; therefore, the deep features in the same space can be measured under the same metric. However, the Siamese network can be only applied to the optical-SAR image matching problem; yet, no subsequent optical-SAR images joint interpretation work.
Based on the analysis above, we innovatively propose a bridge neural network-(BNN-) based optical-SAR image joint intelligent interpretation framework, which utilizes BNN to enhance the general feature embedding of optical and SAR images to improve the accuracy and application scenarios of specific optical-SAR images joint intelligent interpretation tasks. Completely different from the Siamese network, BNN contains two independent feature extraction networks and projects the optical and SAR images into a subspace to learn the desired common representation where features can be measured with Euclidean distance. The proposed framework is shown in Figure 1. BNN is trained on an optical-SAR image matching dataset to learn the common representation from optical and SAR images so that the BNN model can be transferred to the feature extraction module for fine-tuning the interpretation model on optical-SAR/SAR/optical image interpretation datasets.
Further, to verify the effectiveness and the superiority of our proposed framework and promote the development of research in optical-SAR image fusion based on deep learning, it is very important to obtain datasets of a large number of perfectly aligned optical-SAR images. Besides, considering the existing optical-SAR image matching dataset either lacks scene diversity due to the huge difficulty in pixel-level matching between optical and SAR images [29], or has a low resolution limited by the remote sensing satellites [14], or covers only a single area [30], which cannot fully exploit the relevance of optical and SAR images, we publish the QXS-SAROPT dataset, which contains 20,000 optical-SAR patch pairs from multiple scenes of a high resolution. Specifically, the SAR images are collected from the Gaofen-3 satellite [31], and the corresponding optical images are from the Google Earth [32]. These images spread across landmasses of San Diego, Shanghai, and Qingdao. The QXS-SAROPT dataset under open access license CCBY is publicly available at https://github.com/yaoxu008/QXS-SAROPT.
On this basis, we conduct experiments on the optical-to-SAR crossmodal object detection to demonstrate the effectiveness and superiority of our framework. In particular, based on the QXS-SAROPT dataset, our framework can achieve up to 96% high accuracy on four benchmark SAR ship detection datasets.
The contributions of this paper can be summarized as follows: (i) We propose a BNN-based optical-SAR image joint intelligent interpretation framework, which can effectively improve the generic feature extraction capability of optical and SAR images, and thus improving the accuracy and the application scenarios of specific intelligent interpretation tasks for optical-SAR/SAR/optical images (ii) We publish the optical-SAR matching dataset: QXS-SAROPT, which contains 20,000 optical-SAR image pairs from multiple scenes of a high resolution of 1 meter to support the joint interpretation of optical and SAR images (iii) The BNN-based optical-SAR image joint intelligent interpretation framework is applied to SAR ship detection and achieves high accuracy on four SAR ship detection benchmark datasets  Figure 1: The BNN-based optical-SAR image joint intelligent interpretation framework.

Methodology
In this section, the details of the bridge neural network (BNN) and the proposed BNN-based joint interpretation framework are introduced.

Bridge Neural Network.
The bridge neural network (BNN) proposed in [33] is adopted to learn the common representations of optical and SAR images on the optical-SAR image matching tasks, as shown in Figure 2.
is the set of corresponding optical images, here, we consider samples from S p = fx i s , x i o g, whose image pairs from the same region, as positive samples and samples from S n = fx i s , x j o g ði ≠ jÞ as negative samples. Different from Siamese network, BNN contains two separate feature extraction networks: SAR network f s ðx s ; θ s Þ and optical network f o ðx o ; θ o Þ with parameters ðθ s , θ o Þ, respectively, extracting features from SAR and optical images ðx s , x o Þ. To decrease the feature dimension, following the feature extraction backbone, we employ a 1 × 1 convolution layer and a 4 × 4 averagepooling layer to the feature map. Finally, a linear layer with sigmoid activation function is followed to project the feature map into the n-dimension common feature representations: Then, the BNN outputs the Euclidean distance between z s and z o to measure the relevance of the input SAR-optical image pairs, which is described as where n is the dimension of z s , z o . The Euclidean distance indicates whether the input data pairs fx s , x o g have a potential relation. And the closer distance between them, the more relevant they are. Specifically, the distance between positive samples tends to 0 while the distance between negative samples is close to 1. Therefore, the loss on positive samples and negative samples is set as follows: Hence, the problem of learning the common representations of SAR-optical images is taken as a binary classification problem. The overall loss of BNN can be written as where α is a hyperparameter to balance the weights of positive loss and negative loss. Then, the best weights ðθ * s , θ * o Þ can be obtained via a optimization issue: 2.2. Optical-SAR Image Joint Intelligent Interpretation. Since BNN projects optical-SAR image patches into a common feature subspace, the model can well mine the correlation between optical and SAR images and thus improving the feature learning ability of optical-SAR images. Based on BNN, we propose the optical-SAR image joint interpretation framework, which can jointly utilize optical and SAR features enhanced by BNN and improve the performance of specific interpretation tasks of optical-SAR/SAR/optical images. As depicted in Figure 3, we present two different usage scenarios of the proposed BNN-based optical-SAR image joint intelligent interpretation framework. As shown in Figure 3(a), for applications of optical and SAR fusion intelligent interpretation, such as object detection, classification, and segmentation, our framework can learn the common representations from SAR and optical images and enhance their feature learning ability for better interpretation performance under all-weather in full-time. As for applications of crossmodal intelligent interpretation, see Figure 3(b), our framework first utilizes BNN to project optical and SAR images into a common feature space and mines their complementary and correlated information through optical-SAR image matching. Then, the feature 3 Space: Science & Technology extraction module of BNN for the image modality to be interpreted is used as the pretained model for feature embedding of the specific crossmodal intelligent interpretation task. In this way, plentiful complementary features transferred from images of the other modality during the learning of common feature space can be used to enhance the feature embeddings to be interpreted. Benefited from the enhanced feature embeddings, our framework can effectively improve the interpretation performance of crossmodal SAR/optical images.
Take SAR ship detection as an example. SAR ship detection in complex scenes is a great challenging task. And CNN-based SAR ship detection methods have drawn considerable attention because of the powerful feature embedding ability. Due to the scarce labeled SAR images, the pretraining technique is adopted to support these CNNbased SAR ship detectors. As SAR completely different from optical images, directly leveraging ImageNet [34] pretraining is hardly to obtain a good ship detector. However, our proposed framework can transfer rich texture features from optical images to SAR images to obtain a specific feature extraction model with better SAR feature embedding capabilities. Specifically, our proposed framework resorts to a SAR feature embedding operator from common representation learning based on the optical-SAR image matching task using BNN.

QXS-SAROPT Dataset
To fully exploit the relevance of optical and SAR images and verify the effectiveness of our proposed framework, a large perfectly aligned optical-SAR image dataset with diverse scenes of a high resolution is in need. Considering the existing optical-SAR image matching dataset either lack scene diversity or has a low resolution, we have published the QXS-SAROPT dataset, which contains 20,000 pairs of SAR and optical image patches with a high resolution of 1 m extracted from multiple Gaofen-3 and Google Earth [32] scenes. As far as we know, QXS-SAROPT is the first dataset to provide high-resolution 1m × 1m coregistered SAR and optical satellite image patches covering over three big port cities in the world: San Diego, Shanghai, and Qingdao. The coverage of these images is shown in Figure 4. Algorithm 1 shows the procedure for the QXS-SAROPT dataset construction. Finally, 20,000 high-quality image patch pairs are preserved in our dataset, some of which are shown in Figure 5 for examples.

Experiments
To verify the effectiveness and superiority of our BNN-based optical-SAR image joint intelligent interpretation framework,   Space: Science & Technology the framework is applied to one typical optical-to-SAR crossmodal object detection task. We conduct optical-to-SAR crossmodal object detection tasks on four benchmark SAR ship detection datasets. We select two representative CNN-based ship detection methods: faster R-CNN [35] and YOLOv3 [36], as the benchmarks in our work. We first utilize BNN to pretrain the feature extraction module for the selected SAR ship detectors, namely, the ResNet50 [8] backbone for faster R-CNN [35] and the Darknet53 [36] backbone for YOLOv3 [36], based on QXS-SAROPT with 14,000 image pairs as the training set and the remaining 6,000 image pairs as the testing set. Then, the pretrained model with better SAR feature embedding capabilities by common representation learning from optical and SAR images is used for finetuning the corresponding SAR ship detector.
The AIR-SARShip-1.0 dataset consists of 31 highresolution large-scale 3000 × 3000 images from the Gaofen-3 satellite. 21 images are randomly selected as training and validation data, and the remaining 10 images are used for testing. The AIR-SARShip-2.0 dataset includes 300 images of size 1000 × 1000 with resolutions ranging from 1 m to 5 m from the Gaofen-3 satellite. 210 images are randomly selected as training and validation data, and the remaining 90 images are used for testing. Images in AIR-SARShip-1.0 and AIR-SARShip-2.0 datasets are cropped into 512 × 512 pixels with a 0:5 overlap. The HRSID dataset contains 5604 SAR images of size 800 × 800 and is divided into training and testing set at a ratio of 13 : 7. The resolutions of images in the SSDD dataset range from 1 m to 15 m, and 1160 images are divided into 928 images for training and 232 images for testing.
For faster R-CNN [35], we directly input each image of AIR-SARShip-1.0, AIR-SARShip-2.0, and HRSID into the 1. Select three SAR images acquired by the Gaofen-3 satellite [31], which contain rich land cover types such as inland, offshore, and mountains. The spatial resolution of SAR imagery is 1m × 1m for each pixel 2. Download the optical images of the corresponding area from Google Earth [32] with a spatial resolution of 1m × 1m 3. Cut the whole optical-SAR image pair into several subregion image pairs according to the complexity of land coverage. After that, we can register the subregion image pairs separately instead of directly registering the whole image pair 4. Locate matching points of the subregion optical-SAR image pairs manually, which are selected as the geometrically invariant corner points of buildings, ships, roads, etc 5. Use an existing automatic image registration software to register the subregion optical-SAR image pairs. Optical imagery is registered to the fixed SAR image through the bilinear interpolation method 6. Crop the registered subregion optical-SAR image pairs into small patches of 256 × 256 pixels with 20% overlap between adjacent patches 7. Double checked all image patches manually to ensure that every image contains meaningful information and texture. Remove indistinguishable or flawed images, such as images with similar scenes, texture-less sea, or visible mosaicking seamlines    [36] as backbone is both trained in SGD for 200 epochs with a batch size of 20. The initial learning rate is set as 0.01 and then divided by a factor of 2 at the 30th and 100th epochs. The SAR and optical images are encoded into a 50-dimensional feature representation subspace. The ratio of positive and negative samples is set as 1 : 1 and the adjusting factor α = 1.

SAR Ship Detectors.
For the faster R-CNN [35] benchmark, all models are trained with SGD for 14 epochs with 0.0001 weight decay and 0.9 momentum, and the batch size is set to 8. The initial learning rate is 0.02 and is then divided by 10 at the 8th and 12th epochs. For the YOLOv3 [36] benchmark, all models are trained with SGD for 240 epochs with 12 images per minibatch. The initial learning rate is set as 0.001 and is then divided by 10 at the 160th and 200th epochs. The IoU threshold is set as 0.5 when training and testing for rigorous filtering of the bounding boxes with low precision. Warm-up [8] is used at the first 500 iterations during the training stage to avoid gradient explosion. The same settings are applied for all experiments for a fair comparison. Table 1 shows the results of optical-SAR image matching using BNN [33] on the QXS-SAROPT dataset, which suggests that BNN achieves an outstanding performance on both ResNet50 [8] and Darknet53 [36] backbone. Specifically, the matching accuracy based on ResNet50 and Darknet53 reach up to 82:9% and 82:8%, respectively, demonstrating that BNN can learn useful common representations and well predict the relationship between SAR and optical images. Image pairs with different matching results by BNN are shown in Figure 6.

Optical-SAR Image Matching.
To explore the relationship between the training set size and matching results, we randomly select 4000 and 8000 optical-SAR image pairs as the training sets to train BNN on ResNet50. Table 2 shows the results of three sizes of training sets, which indicates that BNN can learn a good common representation even with a small number of training image pairs, and more training data can lead to better matching results. Besides, we show the accuracy, precision, and recall curves of BNN with 8000 image pairs as the training set in Figure 7, which show the convergence process of BNN. Table 3 shows the average precision (AP) of SAR ship detection results on four ship detection datasets using ImageNet pretraining-based SAR ship detector (ImageNet-SSD) and our BNN-based SAR ship detector (BNN-based-SSD) pretrained on QXS-SAROPT. As shown in Table 3, compared with ImageNet-SSD, the AP of detection results is generally improved by BNNbased-SSD. Especially on SAR ship detection dataset AIR-SARShip-1.0 [37], 1.32% and 1.24% performance improvement can be achieved using two-stage detection benchmark: faster R-CNN [35] and one-stage detection benchmark: YOLOv3 [36], respectively. Average precision of ImageNet-SSD and BNN-based-SSD during the training on the test set of HRSID and AIR-SARShip-2.0 with two detectors are displayed in Figure 8. Taking YOLOv3 in AIR-SARShip-2.0 dataset as an example, BNN-based-SSD achieves higher average precision than ImageNet-SSD during the whole training process, indicating the significant improvement of our BNN-based-SSD. Similar phenomena are also presented on other datasets for both benchmarks, demonstrating the superiority of our BNN-based optical-SAR images joint intelligent interpretation framework. All these improved performances prove that our framework can well enhance  the feature extraction capability of SAR ship detectors by common representation learning utilizing BNN and thus boosting ship detection in SAR images even with no additional annotation information of ships. To qualitatively compare these two methods, we visualize some detection results in Figure 9, which shows that our BNN-based-SSD clearly outperforms ImageNet-SSD and significantly reduces the missed detections and false alarms.

Conclusion and Future Work
In this paper, we propose a bridge neural network-(BNN-) based optical-SAR image joint intelligent interpretation framework, which can effectively improve the generic feature extraction capability of optical and SAR images by mining their feature correlation through matching tasks with BNN, and then improve the accuracy and application scenarios of specific optical-SAR image joint intelligent interpretation tasks. In order to fully exploit the correlation between optical and SAR images and ensure the great representation learning ability of BNN, we publish the QXS-SAROPT dataset containing 20,000 optical-SAR patch pairs from multiple scenes of a high resolution of 1 meter. Experimental results on the optical-to-SAR crossmodal object detection task demonstrate the effectiveness and superiority of our framework. It is noted that based on the QXS-SAROPT dataset, our framework can achieve up to 96% high accuracy in SAR ship detection. This research is in its early stage. In the future, we will consider exploring the performance of the proposed framework on optical-SAR fusion intelligent interpretation tasks, such as classification of land use and land cover and building segmentation. To support the research in intelligent interpretation fusing optical-SAR data, we will add label annotations and positions for scenes/objects of interest to every patch pair of the QXS-SAROPT dataset. In addition, to further explore the potential value of the QXS-SAROPT dataset, we are going to release an improved version of the dataset in the future, which will cover more land areas with versatile scenes and different sized patch pairs suitable for various optical-SAR data fusion tasks.
At a more macroscopic level, there are plentiful aspects that deserve deeper investigation. Currently, our approach to interpreting multimodal remote sensing images is verified by experiments on the ground. However, the onboard processing of remote sensing images will be a trend in the future.
Unfortunately, running deep learning models tends to be a high-power consumption process, let alone the tight constraints of onboard memory and computing resources. In    Space: Science & Technology this case, deep learning model compression is an effective and necessary technique to achieve onboard processing in our future work. The purpose of model compression is to achieve a model with fewer parameters, calculation amount, and less RAM to run without significantly diminished accuracy. Popular model compression methods include pruning [41], quantization [42], low-rank approximation and sparsity [43], and knowledge distillation [44,45]. Furthermore, the formation of SAR images from echoes is the first inevitable step of SAR data processing nowadays, based on the algorithms such as back projection [46], compressed sensing [47], or signal processing. Therefore, the SAR application pipeline consists of multiple operations and varieties of complex calculations. In our future work, we will attempt to develop a deep learning framework that performs an integrating SAR processing workflow end to end, from the reflected echoes to the interpretation results. This will help to reduce the complexity of the onboard processor and further improve the processing efficiency.

Conflicts of Interest
All authors declare no possible conflicts of interests.