Multisensor Remote Sensing Imagery Super-Resolution with Conditional GAN

Despite the promising performance on benchmark datasets that deep convolutional neural networks have exhibited in single image super-resolution (SISR), there are two underlying limitations to existing methods. First, current supervised learning-based SISR methods for remote sensing satellite imagery do not use paired real sensor data, instead operating on simulated high-resolution (HR) and low-resolution (LR) image-pairs (typically HR images with their bicubic-degraded LR counterparts), which often yield poor performance on real-world LR images. Second, SISR is an ill-posed problem, and the super-resolved image from discriminatively trained networks with l p norm loss is an average of the in ﬁ nite possible HR images, thus, always has low perceptual quality. Though this issue can be mitigated by generative adversarial network (GAN), it is still hard to search in the whole solution-space and ﬁ nd the best solution. In this paper, we focus on real-world application and introduce a new multisensor dataset for real-world remote sensing satellite imagery super-resolution. In addition, we propose a novel conditional GAN scheme for SISR task which can further reduce the solution-space. Therefore, the super-resolved images have not only high ﬁ delity, but high perceptual quality as well. Extensive experiments demonstrate that networks trained on the introduced dataset can obtain better performances than those trained on simulated data. Additionally, the proposed conditional GAN scheme can achieve better perceptual quality while obtaining comparable ﬁ delity over the state-of-the-art methods.


Introduction
In recent years, remote sensing satellite imagery has been widely used in various fields [1][2][3][4]. It is an important data source for people to understand the earth and has a wide range of applications, which has attracted substantial research attention. Among its applications, high-resolution remote sensing data is in great demand [5,6]. Due to the undersampling effect caused by the size and alignment density of charge-coupled device (CCD), sensor will confuse high-frequency information with the low-frequency information of the image [7]. However, enhancing resolution via the improvement of imaging hardware is challenging because of the cost and craftsmanship limitations. Therefore, it is more attractive to make software-based image superresolution (SR) in practice [8].
The goal of SR is to exceed the limit of sensor to improve the resolution of the image, which means increasing the number of pixels of the image and provide better spatial details than the original image obtained by the sensor [9,10]. The methods of SR can generally be divided into two categories: single-image super-resolution (SISR) [11] and multi-image super-resolution (MISR) [12]. MISR usually requires a set of low-resolution (LR) images with subpixel misalignment to reconstruct the high-resolution (HR) image [13][14][15]. Practically, this approach is generally computationconsuming and numerically limited only to small increases in resolution (by factors smaller than 2) [16,17]. These limitations lead to the development of SISR, which can recover a HR image from only one of its LR counterparts. In the field of remote sensing, SISR is usually adopted, because it can super-resolve satellite images without the need for multiple images from the same scene while avoiding the step of accurate registration or the need for a satellite constellation [18,19].
SISR is an ill-posed inverse problem. Because the LR image does not have complete image information of the ground truth (or, HR image), therefore, instead of a unique solution, there exist multiple corresponding SR solutions to the same LR image. This fact can be mitigated by adding reliable prior information as a regularization to constrain the solution space [11]. Over the past decades, numerous SISR methods have been developed. The pioneer works on SISR focus on heuristic algorithms and hand-crafted features. They devote in SISR and leverage on interpolation techniques based on spatial structure information, for example, multisurface fitting [11] and wavelet transformation [20]. However, the result is not satisfactory due to the complexity of SISR problems. The learning-based SISR methods are then proposed to improve high-frequency details of image patches, such as sparse coding [21], neighborhood regression methods [22], and mapping functions [23]. However, when the parameters of model are complex, the SR performance deteriorates significantly.
In recent years, with the development of deep learning (DL) in the field of computer vision [24], many deep neural networks (DNNs) have been proposed for SR task, which have outperformed traditional SR methods both in metrics and visual quality. Among them, Dong et al. [25] first introduced convolutional neural network (CNN) in SR task, which is trying to learn a nonlinear mapping from LR to HR in an end-toend manner. The training data are paired images, i.e., paired LR-HR images where LR images are typically bicubicdegraded versions from the HR images. By adopting deeper networks and well-designed network architectures, a multitude of top-performing methods was proposed and achieved the state-of-the-art performance [26][27][28][29][30][31][32][33][34][35], which is the current main stream research in SISR. However, despite their promising performance on benchmark datasets, there are two underlying limitations to existing SR methods.
First, because it is difficult to obtain paired high-and lowresolution images in the real world for training, existing methods are usually discriminatively trained on simulated datasets, i.e., HR images with their bicubic-degraded LR counterparts. Consequently, the trained SR networks have poor generalization capacity and often yield poor performance when directly used to super-resolved real-world data, for example, remote sensing satellite imagery. Therefore, it is necessary to consider the specific degradation from HR images to LR images according to the application scenario, so as to train a more robust network with better generalization capacity.
Second, typically, the purpose of SR is to restore the high-frequency information and details lost in the LR image to achieve high fidelity and good visual perceptual quality. Fidelity measures the degree of distortion between superresolved images and HR images, which matches on the pixel-wise aspect, while the perceptual quality metrics describe the difference in their distribution. Note that high perceptual quality does not need necessarily meet the high fidelity. Although most research still aims at fidelity, perceptual super-resolution research began to flourish. Their results are less blurry and more realistic [36]. Theoretically, it can be proved that a super-resolved image from a network trained discriminatively with l p norm loss is the average of the solution-space corresponding to the LR image [37]. Searching in the whole solution-space is not a trivial work. Even worse, the super-resolved images are usually overly blurry and smooth, which lack high-frequency details and textures in high variance regions. Ledig et al. [38] introduced generative adversarial network (GAN) for SR to drive the reconstruction towards the natural image manifold and producing perceptually more convincing solutions. However, although the generated images are more realistic, it usually results in a decrease in the fidelity of super-resolved images. Therefore, how to further reduce the solution space and obtain super-resolved images with high perceptual quality while keeping high fidelity is still a problem.
In this paper, we propose a novel conditional GAN scheme to super-resolved multisensor remote sensing satellite imagery. There are many works which apply GANs in the conditional setting, such as discrete labels [39], text [40], and medical images [41]. With the global availability of Multispectral Instrument (MSI, on-board Sentinel-2 satellite) data and Operational Land Imager (OLI, on-board Landsat-8 satellite) data which have 10 m and 30 m ground sample distance (GSD), respectively, it is conceivable that we can construct a dataset of LR-HR image pairs with OLI data as the LR image and MSI data as the HR image. This dataset, termed OLI2MSI, is a collection of paired real-world multisensor LR-HR data by carefully selecting relatively cloud-free MSI and OLI image in the same location obtained within a suitable temporal window. Experiments demonstrate that networks trained on this dataset can have better performance than that trained on simulated dataset (e.g., bicubic down-sampling), which properly addresses the first limitation. To address the second limitation, we introduce conditional GAN to further reduce the solution space. Experiments show the superiority of our method on fidelity and perceptual quality of the super-resolved images.
Our contributions can be summarized as follows: (i) Compared with the current mainstream SR methods which use simulated LR-HR paired data for training, we focus on real-world application and use a multisensor dataset, OLI2MSI, to train SR model, which can be applied to super-resolve Landsat-8 data and can get better performance than that trained on simulated data (ii) We introduce a new dataset, OLI2MSI, a real-world multisensor dataset for remote sensing imagery SR with an upscale factor of 3. Images taken from Sentinel-2 MSI serve as ground truth of the LR images which are taken from Landsat-8 OLI (iii) We propose a novel conditional GAN scheme for SR task which can further reduce the solutionspace. Thus, the super-resolved images can not only have high fidelity but also be more realistic and have high perceptual quality  [42]. The sensor OLI, on-board Landsat-8, collects image data for 9 shortwave spectral bands over a 190 km swath with a 30 meter (m) spatial resolution for all bands except the 15 m Pan band [43]. Sentinel-2 is an earth observation mission from the Copernicus program developed by the European Space Agency (ESA). The mission is a constellation with two identical satellites, Sentinel-2A and Sentinel-2B, to meet the high frequent revisits requirement, which are phased 180 degrees from each other on the same orbit. The on-board instrument MSI has 13 spectral channels in the visible/near-infrared and short wave infrared spectral range: four bands at 10 m, six bands at 20 m, and three bands at 60 m spatial resolution [44]. The orbital swath width is 290 km. Within the 13 bands, the 10 m spatial resolution allows for continued collaboration with the SPOT-5 and Landsat-8 missions. Both OLI and MSI have stringent radiometric performance requirements and can provide wellcalibrated sensor data. Barsi et al. [45] demonstrated that OLI and MSI showed stable radiometric calibration, with consistency between matching spectral bands to approximately 2.5%. This creates a prerequisite for us to make a paired dataset based on OLI and MSI images.
In order to build the paired dataset for SR, it is necessary for the two sensors to image common ground targets in the same location and spectral bands of the electromagnetic spectrum. For the sensors investigated in this work, i.e., OLI and MSI, there are 6 common bands (band 2, 3, 4, 5, 6, and 7 for OLI and band 2, 3, 4, 8a, 11, and 12 for MSI, respectively). Because the spatial resolution of the 6 bands for MSI is not the same, we finally choose the three bands of blue, green, and red, resulting in an upscale factor of 3 from OLI to MSI. We select the southwest region of China as study area, which contains abundant surface and ecological types, such as forest, farmland, lake, and urban residential area.
In order to minimize the difference in atmospheric conditions and environmental changes, we search all the OLI level-1c data and MSI level-1c data in the study area within a temporal window less than 1 hour and then filter by selecting the data less contaminated by cloud. The final selected scenes (for Sentinel-2 are granules) are listed in Table 1 with their footprints shown in Figure 1. All the Landsat-8 OLI data used in this study were downloaded from the United States Geological Survey (USGS) Earth Resources Observation and Science (EROS) Data Center (https://earthexplorer.usgs.gov/). Also, Sentinel-2 MSI data can be accessed from the Copernicus Open Access Hub (https://scihub.copernicus.eu/). The digital number (DN) values of OLI and MSI level-1c data are converted to top-of-atmosphere (TOA) reflectance. Atmosphere correction is not performed because atmosphere conditions are similar on the condition that the temporal windows between matched OLI and MSI scenes are less than one hour. Since Landsat-8 and Sentinel-2 have spatial misalignment that varies regionally depending on ground control point quality [46], all the OLI data are resampled to 10 m resolution using the bilinear upsampling method followed by a registration step to MSI images by ECC algorithm [47]. All data are then cropped to 480 × 480 pixels images without overlapping. We finally get 5325 images and randomly divided them into two parts: 5225 images for training set and 100 images for testing set. Some LR-HR image-pair samples in OLI2MSI traning set can be seen from Figure 2, where the HR images are less blury and have more details than the LR ones.

Methodology.
A generative adversarial network (GAN) is composed of a generator G and a discriminator D, which aims to model a distribution by forcing the generated samples from random noise by G to be indistinguishable from the real samples by D. The GAN objective is to find a Nash equilibrium to the following two players min-max problem as described in its original form [48]: where z is a random noise vector drawn from distribution p, and y is a real sample drawn from data distribution q data . For our task, the objective of the generator G is to  3 Journal of Remote Sensing map the input LR images to super-resolved images, while the discriminator D aims to distinguish HR images from the super-resolved images. In this context, the input of G is not random noise vector z but an observed LR image termed as x thereafter, and y is the corresponding real HR image. Several works have used GANs for SR task [38,49], but only applied in an unconditional mode. These works usually rely on other terms (e.g., L2 or L1 norm) to force the output to be conditioned on the input, which will cause blurry effects and low perceptual quality. Though the GAN framework has been adopted to reduce the solution-space, the coefficient of GAN loss is usually so small that the capacity to reduce the solution-space is limited. In this work, we apply GAN in the conditional setting, which can further reduce the solution space. Consequently, the super-resolved images can not only have high fidelity but also be more realistic and have high perceptual quality.
In contrast to vanilla GANs, conditional GANs [39] learn a mapping from observed LR images x to the superresolved ones GðxÞ, while the discriminator is modeling a conditional distribution to distinguish from "real" HR images y to the "fake" generated images on the condition of x [50]. Practically, we adopt the LSGAN [51] variant of this optimization problem, thus, the objective of a conditional GAN can be expressed as follows: where the purpose of training G is to minimize this objective against that of D to maximize it, i.e., G * = arg min G max D L cGAN ðG, DÞ. In order to test the importance of conditioning the D, we compare to SR models trained discriminatively (i.e., not GAN framework) and an unconditional variant in which the D does not observe x: In practice, training GANs only using L cGAN ðG, DÞ may result in mode collapse. Previous GAN-based SR works   Journal of Remote Sensing would mix the GAN objective with a content loss, such as L2 or L1 loss. For the proposed conditional GAN scheme for SR task, the job of D remains unchanged, i.e., to distinguish the "real" HR images from the "fake" generated images, except in the fact that the conditional GAN is trying to model a conditional probability distribution while the vanilla GAN is unconditional. For the G, it aims not only to fool the D but also to generate results which are near to the ground truth, that is what the L2 or L1 loss does as a fidelity term. We also adopted this strategy, using L1 loss as the fidelity term as it encourages less blurring [52]: This is the most widely used optimization target for image SR, because it can effectively capture low-frequency information. Though it can achieve high peak signal-tonoise ration (PSNR) [53] and structural similarity (SSIM) [54] score, the super-resolved images often lack highfrequency contents and have low perceptual quality with overly smooth textures. Zhang et al. [55] have demonstrated that the PSNR and SSIM are simple, shallow functions and proposed the Learned Perceptual Image Patch Similarity (LPIPS) metric which can better account for many nuances of human perception. In this work, we introduce LPIPS as a perceptual loss term, which can drive the SR solutionspace towards the natural image manifold producing perceptually more convincing solutions: where the LPIPS is the pretrained network from [55]. The overall loss function can be formulated as follows: where the λ 1 and λ 2 are the corresponding weight coefficients of L cGAN ðG, DÞ and L LPIPS ðGÞ, respectively.

Network Architectures.
In this work, we focus on the conditional GAN scheme to further reduce the solution-space which allows the generator can learn to create solutions images that not only have high fidelity but also are more realistic and have high perceptual quality. Thus, we directly adopt the state-of-the-art SR network DRN [35] as our generator G. As for the discriminator D, it aims to learn the conditional probability distribution of patches of the input and discriminate between real HR patches from this distribution and fake super-resolved patches generated by G. Therefore, we design a conditional patch discriminator architecture termed as conditional PatchGAN to capture the statistics of local patches only. For the SR task, i.e., a low-level computer vision task which mainly concerns the textures of local patches, it is not necessary to capture the global contextual information, but only information of each patch with relatively suitable size in an image. To discriminate patches of an image, we use a fully convolutional patch discriminator as introduced in [50]. We use convolution with strides and no pooling layers to achieve a relatively big receptive field of a N × N patch (where N can be much smaller than the full size of the image). Therefore, the discriminator can implicitly classify each N × N patch separately to be real or fake. The output of D is a heat map where each pixel in this map indicates how likely its surrounding patch is to be drawn from the learned patch distribution. See Figure 3 for architecture details of the discriminator.
Such a conditional PatchGAN discriminator which penalizes structures at the scale of N × N patches is sufficient to discriminate between HR images and super-resolved images by G on the condition of the input LR images. Compared to the discriminators with a fully-connective layer at the last and output a scalar that indicates how the whole image fits the distribution of natural images, the conditional PatchGAN can model the image as a Markov random field with an independent pixel diameter of a patch size which can be regarded as a form of textual loss [56][57][58], with fewer parameters and can be applied to arbitrary size images. In addition, the patch size N can be flexibly adjusted by the stride of convolution layers or the number of convolution layers with strides.   During the training process, we randomly crop a 192 × 192 patch from HR image and the corresponding 64 × 64 patch from LR image in OLI2MSI dataset (for simulated data, the LR patch is the down-sampled version of the HR patch). We augment the input data with random flip and rotation before feeding to networks with a batch size of 16. For optimization, we use Adam optimizer with β 1 = 0:9 and β 2 = 0:99. All the networks are trained for 200,000 iterations (≈900 epochs) with an initial learning rate of 3 × 10 −4 and decayed to 1 × 10 −7 by a cosine annealing strategy. It takes about 2 days to train the proposed method with 2 TITAN xp GPUs. Compared with the non-GAN network, it will take almost 2 times the time to train a GAN-based one.

Datasets and Implementation
In the process of testing, the D is discarded. It takes ≈0:5 second to super-resolve a 160 × 160 image with a TITAN xp GPU. More details of testing and evaluation will be introduced in the next two sections.

Evaluation Metrics.
We adopt four image quality metrics to quantitatively evaluate the quality of super-resolved images. Besides the commonly used peak signal-to-noise ratio (PSNR) [53] and structure similarity index (SSIM) [54] which can be regarded as fidelity metrics that measure the difference between ground-truth HR images and superresolved images, we also adopt two perceptual metrics, i.e., the learned perceptual image patch similarity (LPIPS) [55] metric and naturalness image quality evaluator (NIQE) [59] which can better account for many nuances of human perception. Note that NIQE is a no-reference image quality evaluator.
For fair comparison, all the PSNR and SSIM measurements are calculated on the Y-channel of image, with a border crop of 5-pixels wide for each border. The version of the LPIPS metric we used is 0.1, and the default settings are adopted during training and inference. Before calculating the NIQE score, we train a custom NIQE model using all the HR images in OLI2MSI to model the distribution of remote sensing images. Among the four image quality metrics, a higher value of PSNR and SSIM indicates a higher fidelity for HR images, while a lower value of LPIPS and NIQE score means better visual quality and more agree with human perception.
In order to validate the effectiveness of the introduced OLI2MSI dataset in remote sensing imagery SR task, we train all the networks with the simulated paired SR dataset in the way proposed in their papers and then use the testset of OLI2MSI to test. Average values of the 4 quantitative evaluation metrics of the reconstructed results over the entire testset are given in Table 2. From Table 2, it can be seen that all the models trained on OLI2MSI dataset can have better performance than those trained on simulated dataset in terms of both fidelity and visual quality evaluation metrics, which further illustrates the necessity and effectiveness of adopting a multisensor dataset in remote sensing imagery SR task. The proposed conditional GAN scheme for SR, which adopts the DRN network [35] as the generator (termed as cDRSRGAN), is actually a GAN-based SR model. On the one hand, from the results trained on OLI2MSI As for the two other GAN-based methods, although they have better performance in visual quality evaluation metrics, they often behave poorly in fidelity and may cause artifacts in many cases. The visual comparisons in Figure 4 show that our model produces sharper edges and shapes, while other baselines may give more blurry ones. Especially for the SRGAN and ESRGAN, they have much lower scores in LPIPS and NIQE. This is mainly due to the artifacts produced by GAN, which are not real textures in the scene.
One point that needs to be emphasized again is that although cDRSRGAN does not have best performance in either fidelity terms (PSNR, SSIM) or visual quality terms (LPIPS, NIQE), it can produce super-resolved images that not only have high fidelity but also be more realistic with high perceptual quality due to the conditional GAN training scheme to further reduce the solution-space. The visual comparisons results in Figure 4 demonstrate the effectiveness of the proposed scheme in generating more accurate and visually promising super-resolved images.
In order to demonstrate the generalization capacity of the proposed method, we apply our method to other Landsat8-OLI images representing distinct surface types. These images are neither in the training set nor in the testset of OLI2MSI dataset. From Figure 5, we can see that the super-resolved images by our method can obtain a promising performances, which have more sharp textures and details. This is because what the network learns is the mapping from the low-resolution images to high-resolution images, which represents the degradation from Sentinel2-MSI images to Landsat8-OLI images and is not limited to any land surface type of remote sensing images.
3.4. Ablation Study on Conditional GAN Scheme. We conduct an ablation study on the conditional GAN scheme and the introduced LPIPS loss and report the results in Table 3. Compared to the baseline (i.e., DRN), we adopt the vanilla GAN scheme to train the DRN (i.e., set the 7 Journal of Remote Sensing DRN as the generator, termed as DRSRGAN) and get a poorer (lower) PSNR and SSIM score while a better (lower) LPIPS and NIQE score as expected. It adopts the L1 loss as the content loss which can effectively capture lowfrequency information while the high-frequency details and textures in the super-resolved images mainly come from the artifacts caused by GAN. Once the conditional GAN scheme is adopted (termed as cDRSRGAN), the model yields an even higher PSNR& SSIM and a lower LPIPS& NIQE score than the baseline due to the fact that the conditional GAN scheme is different from L1 loss and can reduce the solution-space in another aspect which results in better performance in both fidelity and visual quality. It can be seen that introducing the LPIPS loss as the perceptual loss can yield more visually pleasing results with a small drop in PSNR& SSIM. These results suggest that the conditional GAN scheme can effectively improve the reconstruction of HR images by introducing an additional constraint to reduce the solution space.
To test the effect of receptive field size in the conditional PatchGAN discriminator, we vary the patch size N by changing the number of blocks in D. From Table 4, it demonstrates that the super-resolution performance is not better as the receptive field size increases. Using a PatchGAN with small receptive field size will lead to some artifacts, therefore, resulting in a low PSNR value. As the N increases, the artifacts can be alleviated slightly and get better scores. However, an excessive large N will get considerably lower scores. This is because a relatively large N is sufficient. Additionally, the conditional PatchGAN has more parameters as the number of blocks grows, thus would be harder to train. LR SR HR Figure 5: Visual results of the proposed method in Landsat8-OLI images with distinct land surface types. Note that these images are neither in the training set nor in the testset of OLI2MSI dataset. The first row is the Landsat8-OLI LR images, the second row is the super-resolved images by our method, and the third row is the corresponding HR images from Sentinel2-MSI.

Conclusion
In this paper, we introduce a new multisensor paired superresolution dataset (i.e., OLI2MSI) and proposed a novel conditional GAN scheme to super-resolved real-world remote sensing satellite imagery. The OLI2MSI is a satellite remote sensing imagery dataset composed of Landsat8-OLI and Sentinel2-MSI images, where OLI images serve as LR images and MSI images are regarded as ground truth HR images. Experiments demonstrate that networks trained on this dataset can have better performance than that trained on simulated dataset (e.g., bicubic down-sampling). Furthermore, the proposed conditional GAN scheme can further reduce the solution-space of SR. Thus, the super-resolved images can not only have high fidelity but also be more realistic and have high perceptual quality. Extensive experiments show the superiority of our method on fidelity and perceptual quality over the considered baseline methods.

Data Availability
All the Landsat-8 OLI images used in this study can be downloaded from the United States Geological Survey (USGS) Earth Resources Observation and Science (EROS) Data Center (https://earthexplorer.usgs.gov/). Also, the Sentinel-2 MSI images can be accessed from the Copernicus Open Access Hub (https://scihub.copernicus.eu/). The OLI2MSI dataset introduced in this study can be download in https://github.com/wjwjww/OLI2MSI.