Feature Enhancement Network for Object Detection in Optical Remote Sensing Images

Automatic and robust object detection in remote sensing images is of vital signi ﬁ cance in real-world applications such as land resource management and disaster rescue. However, poor performance arises when the state-of-the-art natural image detection algorithms are directly applied to remote sensing images, which largely results from the variations in object scale, aspect ratio, indistinguishable object appearances, and complex background scenario. In this paper, we propose a novel Feature Enhancement Network (FENet) for object detection in optical remote sensing images, which consists of a Dual Attention Feature Enhancement (DAFE) module and a Context Feature Enhancement (CFE) module. Speci ﬁ cally, the DAFE module is introduced to highlight the network to focus on the distinctive features of the objects of interest and suppress useless ones by jointly recalibrating the spatial and channel feature responses. The CFE module is designed to capture global context cues and selectively strengthen class-aware features by leveraging image-level contextual information that indicates the presence or absence of the object classes. To this end, we employ a context encoding loss to regularize the model training which promotes the object detector to understand the scene better and narrows the probable object categories in prediction. We achieve our proposed FENet by unifying DAFE and CFE into the framework of Faster R-CNN. In the experiments, we evaluate our proposed method on two large-scale remote sensing image object detection datasets including DIOR and DOTA and demonstrate its e ﬀ ectiveness compared with the baseline methods.


Introduction
Object detection has always been a popular and important task in computer vision [1]. In recent years, the volume of remote sensing data is exploding with the development of earth observation technologies. Faced with the need of automatic and intelligent understanding of remote sensing big data, multiclass object detection is becoming a key issue in remote sensing data analysis [2,3]. More recently, deep learning methods have achieved promising results on natural images, which resulted from the powerful ability of exploiting high-level feature representations, thus offering an opportunity in the interpretation applications of satellite images including urban planning, land resource management, and rescue missions.
However, object detection in optical remote sensing images still remains as a tough challenge due to the particular characteristics of the data, as shown in Figure 1. Firstly, com-pared with natural scene images that are usually captured by the ground-level cameras with horizontal perspectives, remote sensing images are obtained in the bird's-eye view perspective with a wide range of imaging area. Secondly, remote sensing images vary largely in object scale and aspect ratios. This is not only due to the difference of the Ground Sampling Distance (GSD) of aerial and satellite sensors but also as a result of intraclass variations. Thirdly, the objects in remote sensing images often present different visual appearances and optical properties due to diverse imaging conditions such as viewpoints, illumination, and occlusion [3,4]. Last but not least, there exists unbalanced distribution of foreground objects and complex background information, especially in intricate landforms and urban scenarios. All of these issues pose great challenges for current state-of-theart natural image detection algorithms.
Aiming at addressing these challenges to some extent, we propose a novel Feature Enhancement Network (FENet) for robust object detection in remote sensing images. Figure 2 shows the overview of our proposed network. On the one hand, remote sensing images often contain rich spatial and texture cues as well as complex background environment information, which is a collection of both useful and useless information. Therefore, there is a need to guide the network to focus on the features that are more distinguishable for the current object detection task. To this end, we design a Dual Attention Feature Enhancement (DAFE) module to explore discriminative feature representations in both spatial and channel dimensions. On the other hand, there usually exist highly rich ground object categories in remote sensing Figure 1: Some example images of the DIOR dataset [3] used in our experiments, where the numbers above the bounding boxes indicate the object classes as follows: 1, airplane; 2, airport; 3, baseball field; 4, basketball court; 5, bridge; 6, chimney; 7, dam; 8, expressway service area; 9, expressway toll station; 10, golf field; 11, ground track field; 12, harbor; 13 Figure 2: Framework of our proposed Feature Enhancement Network (FENet) for object detection in remote sensing images. Building on the popular Faster R-CNN with FPN and adopting the backbone of ResNet-101, our FENet mainly consists of a Dual Attention Feature Enhancement (DAFE) module and a Context Feature Enhancement (CFE) module. The DAFE module is used to strengthen the feature representations of FPN by using the Dual Attention Fusion (DAF) of spatial attention and channel attention. The CFE module is used for capturing global semantic information for better classification and bounding box regression by using a context encoding loss.
images and a dataset cannot hold up all the appearances of the objects of interest, which makes it hard for the object detector to infer the object categories we are concerned. However, there are both advantages and disadvantages in view of this trait. The exposure of the ground objects or spatial patterns of the scenes provides useful context clues [5][6][7] on object classification and localization to some extent. For each object in the training procedure, an object label determines what category the object belongs to and a groundtruth box describes where the object locates, in which the contextual information is not fully utilized. It is noticeable that the scene-level context information like correlative objects and surroundings plays a nonnegligible role in object category and location reasoning. Our inspiration is based on the observation that the contextual information in remote sensing images is of great complementation for object classification and localization. For example, airplanes often appear in airports rather than lakes or residential areas and cars would be more likely appear in bridges, overpasses, or expressway service areas rather than rivers or harbors. This motivates us to design a Context Feature Enhancement (CFE) module to leverage global contextual information to extract more semantic features.
In summary, the main contributions of our work are as follows. First, we present a Dual Attention Feature Enhancement (DAFE) module to highlight the network to focus on the distinctive features of the objects of interest and suppress useless ones by reweighting the spatial and channel feature responses. Second, we design a Context Feature Enhancement (CFE) module to exploit global context cues and selectively strengthen class-aware features by leveraging imagelevel contextual information that indicates the presence or absence of an object class. Besides, we employ a context encoding loss to regularize the model training which promotes the object detector to understand the scene better and narrows the probable object categories in prediction. Third, our DAFE and CFE modules are generic and thus can be easily applied to existing object detection methods. In this work, we propose a new Feature Enhancement Network (FENet) by unifying DAFE and CFE into the famous object detection framework of Faster R-CNN. Fourth, we comprehensively evaluate our proposed method on two large-scale remote sensing image object detection datasets, namely, DIOR [3] and DOTA [8], and demonstrate its effectiveness compared with the baseline methods.

Related Work
2.1. Object Detection in Natural Images. Feature extraction plays an important role in object detection since it maps raw input data to high-level feature representations. Traditional methods like Histogram of Oriented Gradients (HOG) [9,10] and Scale Invariant Feature Transform (SIFT) [11] require careful manual engineering and a large amount of time when faced with considerable data examples.
On the contrary, deep learning-based methods can learn powerful feature representations directly and automatically from the raw input data. Therefore, deep learning architecture releases the heavy burden of traditional feature modeling and engineering and thus achieves superior results over traditional feature extraction methods. In recent years, the milestone frameworks of generic object detection can be broadly organized into two mainstream approaches: two-stage detection framework and one-stage detection framework [1,3].
Two-stage methods refer to region proposal generation at the first stage and the following evaluation of the region proposals. R-CNN [12] generates candidate proposals by selective search and becomes one of the pioneers in generic object detection. Fast R-CNN [13] outperforms R-CNN in both detection speed and accuracy with the idea of sharing feature extraction network for all region proposals. Then, an internal region proposal generation framework based on shared deep CNN arises, which shares the convolutional feature maps for region proposal generator and object detector. Typically, Faster R-CNN [14] proposed by Ren et al. designs a Region Proposal Network (RPN) for region proposal generation, encapsulating the task of proposal generation and the detection task in a single network with many shared convolution layers.
One-stage framework directly predicts class probabilities and bounding box offsets in a unified manner. For example, YOLO [15] integrates category classification and bounding box regression into a unified network, which can reach faster detection speed but usually trailed detection accuracy, especially faced with a successive appearance of small object instances. SSD [16] detects multiscale bounding boxes from multilevel feature maps with fully convolutional neural networks. RetinaNet [17] downweights the loss of numerous well-classified examples by reshaping the crossentropy loss and surpasses two-stage methods without compromising detection speed.
Besides, detecting objects with multiscale CNN layers also promotes detection accuracy, since it is clear that the prediction of objects of different scales is suboptimal with the features from a single layer. An alternative way is to use feature pyramids [18]. FPN [19] achieves a top-down architecture to learn features with hierarchical convolution layers and variant scales, which has shown remarkable improvement as a generic feature extractor in several computer vision tasks including object detection.
Since remote sensing images can be obtained with a wide range of ground sample distance, the object size can be varied from tens to thousands of pixels with dramatic aspect ratios [3,20]. Compared with one-stage detection methods, most two-stage methods build proposal generation network firstly, which eliminates most of the easy negative examples and reaches a balanced trade-off in the training procedure. Consequently, we adopt the widely used two-stage detector Faster R-CNN with FPN [19] as our backbone in this paper for accurate detection performance.
In recent years, comprehensive studies have been made to exploit different solutions to the problems of object 3 Journal of Remote Sensing detection in remote sensing images. For example, for the problem of rotation variations of objects in remote sensing images, [20] designed a rotation-invariant layer to extract robust feature representations. References [24,32] proposed an effective rotation-invariant and Fisher discriminative CNN (RIFD-CNN) model to improve detection accuracy. Reference [25] presented a rotation-insensitive and contextaugmented object detection method. Aiming at multiscale object detection problem, [26] introduced a crossscale feature fusion (CSFF) framework. Reference [27] developed an object detection method for remote sensing images by combining multilevel feature fusion and an improved bounding box regression scheme. Reference [33] designed a multiscale object proposal network (MS-OPN) for proposal generation and an accurate object detection network (AODN) for detecting objects of interest in remote sensing images with large-scale variability.
More recently, some literature began to pay attention to the research of oriented object detection in remote sensing images [34][35][36][37][38][39][40][41][42][43][44]. For example, [34] presented a region of interest (RoI) transformer through applying spatial transformations on RoIs and learning the parameters of transformation with the supervision of oriented annotations. Reference [35] proposed to describe an oriented object by gliding the vertexes of each horizontal bounding box on their corresponding sides, and an obliquity factor based on area ratio was further introduced to remedy the confusion issue. R3Det [37] encodes centers and corners information in the features to get a more accurate location. Reference [41] presented a dynamic refinement network which enabled neurons to adjust receptive fields according to the shapes and orientations of target objects and refined the prediction dynamically in an object-aware manner. Reference [36] proposed a new rotation detector, named SCRDet, for detecting small, cluttered, and rotated objects in remote sensing images, which alleviated the influence of angle periodicity by designing a novel IoU-Smooth L1 Loss. Reference [39] used image cascade and feature pyramid jointly with multisize convolution kernels to extract multiscale strong and weak semantic features for oriented object detection. Yao et al. [44] proposed a Single-shot Alignment Network (S 2 A-Net) to alleviate the inconsistency between classification score and localization accuracy, which achieved state-ofthe-art performance on two aerial object datasets. To achieve better detection speed, [42] used a set of default boxes with various scales like SSD to predict oriented bounding boxes. Reference [43] defined a rotatable bounding box to predict the exact shape of objects for detecting vehicles, ships, and airplanes, showing superior capability of locating multiangle objects.
Also, some methods were proposed for weakly supervised object detection (WSOD) in remote sensing images [21,23,[45][46][47][48]. For instance, [21] proposed a coupled weakly supervised learning framework for aircraft detection. Reference [45] proposed a WSOD framework based on dynamic curriculum learning to progressively train object detectors by feeding training images with ascending difficulty. Reference [46] proposed a new progressive contextual instance refinement (PCIR) method to perform WSOD in remote sensing images.

Attention Mechanism.
Feature-based attention has proved its effectiveness in many computer vision tasks as a perception-adapted mechanism [49]. For instance, Squeezeand-Excitation network (SENet) [50] proposed by Hu et al. adaptively recalibrates channel relationships by global information embedding and fully connected (FC) layers. Reference [51] computed weights from nonlocal and local pixels/features as the spatially refined representations. Reference [52] achieves domain attention by a series of universal adaptation layers, following the principle of squeeze and excitation. For the task of object detection in remote sensing images, [22] puts forward an inception fusion strategy as well as pixelwise and channel-wise attention for small object detection in aerial images. Reference [26] inserted a SENet block into the top layer of FPN to model the relationship of different feature channels. Inspired by Mask R-CNN, [40] proposed a refine FPN and multilayer attention network for oriented object detection of remote sensing images.

Review of Faster R-CNN. Faster R-CNN proposed by
Ren et al. [14] is an efficient two-stage detection algorithm, which consists of two main branches, namely, RPN and Fast R-CNN. In the first stage, RPN generates a set of anchor boxes with predefined scales and aspect ratios at each feature map location, followed by two sibling fully connected layers, one for object classification and one for bounding box regression, respectively. In the second stage, a ROI pooling layer is employed to obtain fixed-size outputs for each region proposal before classification and bounding box refinement. The two stages are integrated by several shared convolution layers and can be trained and tested end to end.

3.2.
Overview of Feature Enhancement Network (FENet). The architecture of our proposed Feature Enhancement Network (FENet) for object detection in remote sensing images is illustrated in Figure 2. Building on the popular Faster R-CNN with FPN and adopting the backbone of ResNet-101, our FENet mainly consists of a Dual Attention Feature Enhancement (DAFE) module and a Context Feature Enhancement (CFE) module. The DAFE module is used to highlight the FPN to focus on the distinctive features of the objects of interest and suppress useless ones by using the Dual Attention Fusion (DAF) to jointly reweight the spatial and channel feature responses. The CFE module is used to selectively strengthen class-aware features by leveraging image-level contextual information that indicates the presence or absence of the object classes. The feature representations of the CFE module are concatenated with each ROI feature to make per-proposal prediction. To this end, we employ a context encoding loss to regularize the model training, which could enforce the network to learn the global semantic information through predicting the presence of the object classes in the images, thus promoting the object detector to better understand the images for classification and bounding box regression. 4 Journal of Remote Sensing

Dual Attention Feature Enhancement (DAFE)
. The CNN has shown powerful ability in feature extraction and representation with a large number of parameters. Low-level layers in the CNN architecture contain a large amount of detailed information such as edges and boundaries. As the network goes deeper, the high-level feature representations have diminished location information and specialized in semantic information. How to obtain and choose more discriminative features determines the detection performance. To this end, a Dual Attention Feature Enhancement (DAFE) module is constructed to prompt the network to focus on the distinctive features and suppress the redundant ones that are not useful for the current task by jointly recalibrating the spatial and channel feature responses, as shown in Figure 3. Specifically, in the spatial dimension, we use nonlocal building block [51] to acquire spatial dependencies in the whole feature map. As for channel dimension, SE block [50], which models the channel relationship explicitly from inherent feature maps and so can be directly applied to existing state-of-the-art CNN architectures, is selected for our implementation. These two kinds of attentions are carried out in parallel and then fused for a better capability. Next, we briefly introduce the nonlocal block [51] and SE block [50]. The nonlocal block was designed to capture long-range dependencies through nonlocal operation which calculates the new feature response of each position as a weighted sum of the original features of all positions [51]. Specifically, given an input feature x, its output feature z of a nonlocal block is computed as follows: where W z is the weight matrix that is implemented as 1×1 convolution, "+x i "represents a residual connection which makes it possible to insert a new nonlocal block into any pretrained CNN model without breaking its initial behavior, and y i is the output of the nonlocal operation of the same size as x, which is defined in the following equation: where CðxÞ is the normalization factor set as CðxÞ = ∑ ∀j f ðx i , x j Þ. i is the index of an output position of the features, and j is the index enumerating all possible positions. f is a pair-wise function used to calculate a scalar to represent the relationship between x i and all x j . The function g is used to compute the embeddings of the input signal at the position j by using gðx j Þ = W g x j with W g being a 1×1convolutional operation. In this paper, we use the embedded Gaussian function as the pair-wise function as defined in Equation (3) for the computation of the relationship scalar.
where θðx i Þ = W θ x i and ϕðx j Þ = W ϕ x j are two embeddings computed through the 1×1 convolutional filters W θ and W ϕ . The nonlocal module is inserted into the end of the convolutional stage of ResNet-101 in our experiments, and we investigate the results of different combinations of stages by using the nonlocal block in the experiments.
The SE block can be embedded into any regular CNN architectures with the operations of embedding global information and recalibrating channel-wise dependencies. First, a global average pooling is applied on the spatial dimensions and generate a K ×1×1 vector z, in which the kth element of z is defined as where K is the depth of the feature map and x k ði, jÞ is the value of the kth channel at position ði, jÞ of the input feature map. Then, two FC layers are followed to recalibrate the channel dependencies and a sigmoid activation function is employed to learn nonlinear relationships: where σð·Þ denotes the ReLU function. Finally, the output feature map is obtained by the implementation of channelwise multiplication. Similar to the nonlocal module, the SE block is also added on the end of the convolutional stage to capture channel-wise responses and highlight discriminative features.
Nevertheless, what is the best arrangement for these two blocks in the network? Reference [50] also suggests that the importance of feature channels tends to share a similar weight when using SE block in low-level features, while in high-level features, the importance of each channel becomes more class-specific. To thoroughly investigate this problem, we deployed these two blocks in different residual stages of ResNet [53], respectively, and evaluated their performances by using different combinations, and the results of various combinations can be found in Section 4. Although there are small gaps between different results, we observe that the

Context Feature Enhancement (CFE).
Finally, we propose a novel Context Feature Enhancement (CFE) module that utilizes task-specific features and scene semantics generated from hierarchical feature layers. Since high-level features have more semantic information while low-level features contain specific geometric information such as context and edges, they are good complementation for each other in object detection task. In this model, we integrate the multilevel feature maps to obtain both high-level semantic features and low-level detailed features, which can guide the object category classification and location reasoning in a global manner.
More specifically, we empirically set the downsample rate of 16 to preserve some localization information. Max pooling is used for P2 and P3, and nearest upsampling operation is used for P5, ensuring the consistency of spatial scale. With the above approach, diverse feature representations from different levels can be aggregated. Then, two additional fully connected convolutional layers with sigmoid activation function are added on top of the fusion features to predict the confidence of object categories in the remote sensing scene, and the binary crossentropy loss is adopted for training. This auxiliary branch processes the multilabel classification task through intermediate feature map, thus providing the basic classifiers with global and local knowledge of contextual clues that are correlative to the region of interest. The object category prediction is typically achieved by computing softmax probabilities, which is not feasible for the object classification in such task. As a consequence, we adopt the sigmoid crossentropy loss to measure the probability error in which each class is independent and not mutually exclusive. Specifically, given an input image X ∈ ℝ 3×H×W , the ground-truth label can be denoted as a vector y = ½y 0 , y 1 , ⋯, y C T , where C is the total number of object categories. y i is set to 1 if objects in image X correspond to class i, otherwise it is set to 0, where i ∈ f1, ⋯, Cg.W e represent the predicted class score vector of image X as p = ½p 0 , p 1 , ⋯, p C T , and for all the j training images, the multilabel classification loss is calculated by What is more, [54] has demonstrated that the multilabel classification task based on CNN features retains coarse localization information of objects without using any bounding box annotations. Inspired by this, we aggregate the features obtained by CFE module with box prediction head, which provides not only global and local context information for object category reasoning but also localization information for bounding box regression. In our method, the context feature maps are downsampled to 7×7 to match the same resolution as region proposals after ROI pooling. Then, we concatenate the context features with ROI features and apply a 1×1 convolution operation to reduce channel dimensions while powering the informative representations, which can be seen as a complementation for region proposal detection task. Let L cls denote the object category classification loss and L reg denote the bounding box regression loss. Finally, the loss function can be defined as where λ is a hyperparameter that controls the factor of L CFE . In Section 4, we discuss the choice of λ in detail.  Journal of Remote Sensing To sum up, the DAFE and CFE modules complement each other well to some extent. The features with poor positioning ability and poor discrimination can be enhanced by contextual information, while features with good discrimination are guaranteed not to be significantly weakened.

Experiments
In the following section, we first present the implementation details of DAFE and CFE and conduct an ablation study on the newly published datasets DIOR [3] and DOTA [8].

Datasets and Evaluation Metrics.
In this paper, we perform our experiments on two large-scale remote sensing datasets DIOR [3] and DOTA [8]. As for the former, it consists of 23463 optical remote sensing images and covers with 20 categories. 192472 manually labelled instances with axisaligned boxes are involved, following similar annotation format as PASCAL VOC. The images in the DIOR dataset have the size of 800 × 800 and vary in spatial resolution from 0.5 m to 30 m. We take 11725 images from train and validation splits for training and the rest 11738 images for testing. As for the latter, it contains 2806 aerial images from various sensors and 15 common object categories. The fully annotated DOTA images consist of 188282 instances labeled by arbitrary quadrilaterals, and the image size of the DOTA dataset is large: from 800 × 800 to 4000 × 4000 pixels. We use training and validation sets for training and the rest for testing. The detection accuracy is obtained by submitting testing results to DOTA's evaluation server. All the object categories of these two datasets are reported in Table 1.
In our results, we follow the mean Average Precision (mAP) as the evaluation metric for our experiment and the evaluation of mAP is the same as the metric definition in PASCAL VOC 2007 object detection challenge.

Implementation Details.
Our experiment is performed under the framework of PyTorch and based on the Faster R-CNN with FPN [55]. ResNet-101 is adopted as the backbone network. We run 12 epochs on a NVIDIA Titan Xp GPU with the batch size of 2. The initial learning rate is set to 0.0025 with a learning rate decay of 0.1 at the end of epoch 8 and epoch 11. The momentum is 0.9, and the weight decay is set to 0.0005. During the training process, a horizontal flip data augmentation method is used in the end-to-end proce-dure with stochastic gradient descent (SGD) optimizer. The parameter setting of SE block is the same as [50].
For the images in the DIOR dataset, we keep the original size of 800 × 800 for training and testing. With regard to the DOTA dataset, we crop the original images in the DOTA dataset into 1024 × 1024 patches. The stride of cropping is set to 824; that is, the pixel overlap between two adjacent patches is 200. As commonly used in object detection, ResNet-101 network is pretrained on the ImageNet [56] and fine-tuned on the aforementioned training set.

Experimental Results
. We evaluate our model on the test set of DIOR and DOTA datasets and compare it with the state-of-the-art methods. The experiments are implemented on mmdetection [57] to make a fair comparison, except for CornerNet [58]. As shown in Table 2, on the DIOR dataset, our method achieves 68.3% mAP and outperforms the baseline Faster R-CNN with FPN by 3.2%, which demonstrates its effectiveness for object detection in remote sensing images. Our method shows competitive performance compared to state-of-the-art methods like Libra R-CNN and CornerNet. Moreover, CornerNet performs better results in large objects such as airport, expressway service area, and overpass while it struggles in small and crowed objects including ships and vehicles. As for individual class predictions, we notice that the AP values of the classes of airplane, basketball court, ship, tennis court, vehicle, and windmill only show little improvement. We analyze the reasons as follows. For the ship and vehicle categories, although there are many instances available, they account for a relatively small proportion of the entire images, leading to the information loss seriously after being sampled by the backbone network, which brings difficulty to feature extraction and further enhancement, so the improvement is not obvious. In contrast, for the golf field and ground track field categories with large object sizes,  Besides, for the classes of airplane, basketball court, tennis court, and windmill, the experimental results are closely related to their characteristics. Specifically, the aircraft category has large-scale differences, the appearances of tennis courts and basketball courts are similar and easy to be confused, and the windmill category has shadow interference. These factors undoubtedly increase the difficulty of As for the detection results on the DOTA dataset (see Table 3), our proposed FENet once again achieves the highest accuracy, namely, 74.89% mAP, which outperforms the baseline FPN by 2.89%. The reason of the above promising results is closely related to the proposed DAFE and CFE modules, which enhance the capability of capturing task-related features and balancing global and local information and thus performing well in most object categories. Furthermore, Table 4 presents the running time of FENet on different datasets for a test image of given size. The running time is tested  10 Journal of Remote Sensing on a NVIDIA Titan Xp GPU with the batch size of 1. It can be seen that the proposed approach maintains a fast inference speed while achieving high detection accuracy. Figure 4 illustrates some test samples and the corresponding detection results on the DIOR dataset. As can be found, our proposed method is suitable for some smallsized and medium-sized objects, such as vehicles, ships, and storage tanks, indicating the contributing guidance provided by contextual information. More specifically, these objects usually crowded together and cannot be easily distinguished. The low-level detailed features can provide some localization information, while the high-level semantics facilitate the object reasoning. In addition, the proposed FENet also achieves robust detection performance in the object categories with large scale variation compared with the state-ofthe-art methods, such as baseball field, ground track field, harbor, and stadium. Although the objects in each of these classes present different visual appearances, they may share some common contextual clues to some extent, resulting in relatively stable detection performance.

Ablation Study.
In order to evaluate the effectiveness of the proposed DAFE and CFE modules, we conduct a series of experiments on the DIOR dataset in this section. The impact of different components on detection performance is presented in Table 5. As can be found from the 3rd to 7th rows, the usage of nonlocal module shows no apparent difference in either separate stage or combined stages. The model with the spatial attention mechanism achieved the highest accuracy in two cases: all stages used and only the stage 4 used. According to the "channel" row of Table 5, the results fluctuate very little with diverse groups of stages that utilize SE blocks. Adding SE block to all the convolutional stages makes 1% improvement. In contrast, applying SE block to convolutional stage 5 achieves the highest performance with 1.7% increment compared to baseline result. It is worth noting that the table does not show the results of more combinations of attention blocks at different stages (i.e., double stages and triple stages), because it does not lead to significant performance improvement and sometimes even worse. One possible reason is that the emphasized features from different levels are not properly refined. As a consequence, the Context Feature Enhancement module is designed, where we associate multilevel features to accomplish this goal.
To further investigate how the different combinations of spatial and channel methods affect the final results, we make comparisons between the utilization of single stage and multiple stages in the last two rows of Table 5. It reveals that the DAFE achieves the best performance of 67.7% mAP when we use the nonlocal module in stage 4 and SE block in stage 5. However, the model with nonlocal module and SE block applied on all the stages only achieves 67.0% mAP. Figure 5 gives several visualization results of DAFE. The first column is the original images; the second column is the feature enhancement in the spatial dimension; the third column is the feature enhancement in the channel dimension; the fourth column corresponds to the total feature enhancement of DAFE; the last column is the corresponding detection result.
Furthermore, we also examine how the choice of λ contributes to the detection results. The ablation study is mainly conducted from the following aspects: (1) The individual impact of CFE on baseline. In Table 6, we compare the diverse values of λ in a wide range from 1 to 10. The results suggest that the performance approximately grows 1% by average when the contextual information provided by CFE module is included. While λ takes the value of 5, we obtain the highest performance, particularly up to 1.5% improvement compared to baseline (2) The interaction between CFE and DAFE. From Table 6, we notice that the overall method shows no improvement when λ =1. Then, we change the choice of λ and find better results at 5 and 10. The method also shows little improvement when we further enlarge the hyperparameter λ. This indicates that there is imbalance between losses. When λ is too small, CFE hardly contributes to the network with contextual information. While λ is too large, L cls and L reg can be overwhelmed. We also find that CFE has significant effect on class 5, 6, 9, 12, 15, and 16 of the DIOR dataset, which are typical objects with great scale variations. This indicates that CFE can learn common contextual clues of certain object categories and guide the network to reason reliable possibilities. Besides, the experiments also demonstrate that these two components of the proposed network are complementary to each other

Conclusion
In this paper, we present a novel approach FENet for multiclass object detection in optical remote sensing images, which is aimed at addressing the complex background scenario and sparse object distribution problems. Firstly, the framework utilizes Dual Attention Feature Enhancement module to selectively emphasize informative features from multiple resolutions, thus guiding the network for robust object detection. In the next phase, a Context Feature Enhancement module is introduced to fully leverage the abundant information emerged in remote sensing objects. This branch explores both global and local contextual information like semantics and textures, which bridges the gap of multiscale feature maps. The experiments on DIOR and DOTA datasets verify its effectiveness and show that our proposed method achieves remarkable performance compared with the state-of-the-art algorithms. For future works, we plan to carry on our work in oriented bounding box detection and focus on unusual appearances of objects like exceptional aspect ratios and scales.

Data Availability
The data of DIOR and DOTA used to support this study are publicly available. The DIOR data can be downloaded from the website https://gcheng-nwpu.github.io/datasets while the DOTA data can be downloaded from the website https://captain-whu.github.io/DOTA/index.html. The code is freely available upon request.