Shearlet-Based Structure-Aware Filtering for Hyperspectral and LiDAR Data Classification

The joint interpretation of hyperspectral images (HSIs) and light detection and ranging (LiDAR) data has developed rapidly in recent years due to continuously evolving image processing technology. Nowadays, most feature extraction methods are carried out by convolving the raw data with fixed-size filters, whereas the structural and texture information of objects in multiple scales cannot be sufficiently exploited. In this article, a shearlet-based structure-aware filtering approach, abbreviated as ShearSAF, is proposed for HSI and LiDAR feature extraction and classification. Specifically, superpixel-guided kernel principal component analysis (KPCA) is firstly adopted on raw HSIs to reduce the dimensions. Then, the KPCA-reduced HSI and LiDAR data are converted to the shearlet domain for texture and area feature extraction. In contrast, superpixel segmentation algorithm utilizes the raw HSI data to obtain the initial oversegmentation map. Subsequently, by utilizing a well-designed minimum merging cost that fully considers spectral (HSI and LiDAR data), texture, and area features, a region merging procedure is gradually conducted to produce a final merging map. Further, a scale map that locally indicates the filter size is achieved by calculating the edge distance. Finally, the KPCA-reduced HSI and LiDAR data are convolved with the locally adaptive filters for feature extraction, and a random forest (RF) classifier is thus adopted for classification. The effectiveness of our ShearSAF approach is verified on three real-world datasets, and the results show that the performance of ShearSAF can achieve an accuracy higher than that of comparison methods when exploiting small-size training sample problems. The codes of this work will be available at http://jiasen.tech/papers/ for the sake of reproducibility.


Introduction
Recently, the continuously evolving remote sensing sensor technologies have contributed to the capture of multisource data in the same area [1,2]. Among those numerous remote sensing data, hyperspectral images (HSIs) contain joint spectrum and space information, providing a distinctive discriminating ability for Earth's surface objects. HSIs have hundreds or thousands of narrow spectral bands, covering the spectral region from the visible to the infrared field [3]. In particular, HSIs have both spatial and spectral smoothness, which not only produces detailed and accurate descriptions of objects but also results in a high correlation between adjacent bands [4][5][6]. Based on the above reasons, some obstacles and challenges exist regarding the interpretation of HSI information. Specifically, HSIs are prone to include information redun-dancy as a result of high correlation and the Hughes phenomenon caused by a high spectral dimension [7,8]. In addition, environmental factors, such as clouds and noise, will also cause information confusion when the remote sensor captures scene data [9]. Compared with HSI, LiDAR integrates a laser ranging system, a global positioning system, and an inertial navigation system, so that it can collect the position and intensity information of objects in a threedimensional space [10][11][12]. However, LiDAR works in a single band and lacks semantic information; thus, it has poor discriminative ability in distinguishing targets with similar heights but different spectra [13].
Many works in the literature have proven the effectiveness of combined HSI and LiDAR interpretation, indicating that the intensity information provided by LiDAR can supplement the HSI deficiencies regarding the target height and shape information [5,[14][15][16][17]. In 2013, the HSI and LiDAR data fusion competition, organized by IEEE Geoscience and Remote Sensing Society, greatly promoted the research on HSI and LiDAR data fusion methods for classification [18]. In general, these fusion methods can be roughly divided into three categories: pixel-level fusion, feature-level fusion, and decision-level fusion. The strategy of pixel-level fusion relies on concatenating multisource data directly on the original data, which requires geometric registration. Feature-level fusion is considered as a better approach in that it can achieve better classification performance in most cases [19][20][21][22]. It conducts the feature extraction for each source individually and then combines them. Decision-level fusion methods are aimed at integrating several rough classification results of multisource data into the final classification [23,24]. Although the computational complexity is relatively low, it relies heavily on the original classification results and integration strategies. In fact, due to the inherent shortcomings of single data fusion methods, there are many articles exploring the classification framework that combines feature-level fusion and decision-level fusion [25][26][27].
In addition, it must be mentioned that deep learningbased approaches are also widely favored in recent years for hyperspectral feature extraction and classification. The pioneering work is the stacked autoencoder proposed by Chen et al., which has been used for hyperspectral high-level feature extraction [28]. Subsequently, convolutional neural network (CNN) [29][30][31] and recurrent neural network (RNN) [32,33] have been widely developed. Among them, the representative 3D-CNN can effectively extract the joint spatialspectral information and has shown good performance [34]. Furthermore, the deep learning-based framework, which combines traditional features and network structure, has also been explored [35,36]. However, the model parameter optimization of these deep learning-based approaches generally relies on a large number of training samples, which greatly limits its applicable ability due to the difficulty of sample labeling in remote sensing field. Inspired by the small sample set circumstance, some new strategies have been developed. For example, Yu et al. proposed a novel CNN model by combining data augmentation and 1 × 1 convolutional kernel [37]. More recently, semisupervised CNN and PCANet-derivative methods have also received constant attention because of their performance with limited training samples [38][39][40][41].
Alternatively, wavelet analysis is an important mathematical tool because of its optimal approximate fitness in signal processing. However, in the case of multidimensional images with a discontinuity curve, traditional wavelet loses its sparsity on the edge response [42,43]. Thus, the multidimensional directional wavelet is required. Gabor, which is widely used in texture analysis and feature extraction, can be considered as an early directional wavelet [44,45]. Its inherent drawback is that directions are restricted on each scale once sampled. There has been an emerging series of directional wavelets in the past few decades, such as contourlets, bandlets, curvelets, and shearlets [46][47][48][49]. They all provide a flexible framework in mathematical theory while capturing the geometric features in applications. Among these methods, the shearlet possesses remarkable properties: it accurately captures the edge direction, has an optimal sparse representation for multidimensional data, uses a well-organized multiscale structure, and exhibits fast algorithm implementation and efficient calculation [50][51][52]. In its simplest form, the shearlet starts with the construction of the so-called mother function, and then, it adopts three basic operations (scaling, translation, and direction) to provide a derivative with more shape and direction. Two wellknown properties of shearlets are highlighted as follows: (1) if a point is far away from the edge, then its shearlet coefficient decays rapidly as the scale decreases; (2) if a point is an edge point or a corner point, then its shearlet coefficient decays slowly in the normal direction, while it decays rapidly in other directions. Therefore, shearlets have been widely used for edge and corner detection [53][54][55]. In addition, due to its frequency domain division, there are also some pioneering works using it for denoising, feature extraction, and data fusion [56][57][58][59][60].
During the past two decades in the computer vision field, another emerging and rapidly spreading concept is the use of superpixel segmentation. Specifically, a superpixel is considered as a homogeneous area containing some texture or structural similar pixels [61][62][63]. The edge of the superpixel is a closed curve with continuity, which is different from scenarios encountered with edge extraction algorithms in which continuous scattered points may exist. Moreover, the superpixel should also possess region compactness, shape regularity, and boundary smoothness [64]. Currently, superpixel algorithms are roughly divided into three categories: cluster-based approaches represented by simple linear iterative clustering (SLIC) and simple noniterative clustering (SNIC) [65,66], graph-based approaches represented by entropy rate superpixel segmentation (ERS) and normalized nuts (NCut) [67,68], and gradient-based methods represented by spatial-constrained watershed (SCoW) and superpixels using the shortest gradient distance (SSGD) [69,70]. Notably, fuzzy superpixels were proposed in the past two years and have further enriched the content of superpixels, especially in cases of low spatial resolution such as in remote images [71,72]. However, regardless of the kind of superpixel algorithm, direct or indirect spatial constraints exist for compactness requirements. For example, SLIC directly adds spatial distance to the clustering metric, and ERS requires each superpixel to be as close to the same size as possible. This spatial constraint inevitably leads to conflicts between oversegmentation and undersegmentation. Moreover, the situation worsens since the resistance of this constraint grows as a power series. In other words, due to larger space constraints, a large homogeneous region has to be divided into several small superpixels, and a small area may contain several objects because there is almost no spatial restriction. Nevertheless, some heuristic solutions have been proposed for superpixel number selection [73,74], but this inherent property of superpixels has not been slackened effectively.
In particular, filters play an extremely important role in image processing. There are already many filters based on different applications, such as the mean filter, Gaussian filter, and Gabor filter, for feature extraction [75,76], and the Laplacian, Sobel, and Laplacian of Gaussian (LoG), for edge capture. However, fixed-size filter does not have the ability to obtain the best description of surface objects with various scales, and thus, it is undeniably difficult for a uniform-size filter to achieve globally satisfactory results. In other words, near the edge, a larger size of a filter will cause more confusing information, whereas in the center of the region, a smaller size of the filter hardly works well when abnormal points exist. Some researchers have made some attempts in this area, such as the multiscale spectral-spatial classification method with adaptive filtering [77][78][79] and the spatial adaptive multiscale filtering technique [80,81]. However, the features or classification results in multiple filtering scales were simply concatenated or combined, and it is more desirable to take full advantage of the internal structure of objects and achieve local structure-aware filtering (i.e., automatically adjusting the size of the filter kernel according to the local position).
In this article, we innovatively propose a shearlet-based structure-aware filtering framework, abbreviated as Shear-SAF, for HSI and LiDAR data classification with the help of the above tools. First, superpixel-guided kernel principal component analysis (KPCA) is adopted on raw HSIs for dimension reduction and information focus, which is greatly helpful for subsequent calculation and processing. Then, shearlet transform is implemented on KPCA-reduced HSI and LiDAR, and structural description in frequency domain is achieved; i.e., the high frequency and low frequency, respectively, contain the region information and texture information. They are further processed by energy superposition and time-frequency conversion to attain region features and texture features. Second, a gradual region merging procedure is developed to alleviate the superpixel spatial constraints and enhance the robustness of the proposed ShearSAF method. Specifically, the SNIC superpixel algorithm acts on the raw HSI to address serious oversegmentation and ensure the homogeneity of each small region, and then, the superpixels are progressively combined together according to a well-designed merging criterion that takes all the spectral information, region information, and texture information into account, eventually achieving the final merging map. Third, by locating the edge in the final merging map and calculating the shortest distance between each pixel and the edge, we can obtain a distance map to reflect the relationship between points and edges. Through geometry optimization and threshold processing, a scale map is also extracted to control the filter size, in which the point value indicates the size of the convolution kernel. Thus, the adaptive-size filter for each spatial pixel is obtained. Finally, KPCA-reduced HSI and LiDAR are convolved by these structure-aware mean filters for feature extraction and classification to verify the effectiveness of the proposed ShearSAF approach. For a better description, the detailed process of our ShearSAF is shown in Figure 1, and the main contributions of this article are summarized as follows: (i) First of all, we design a structure-aware filtering scheme for HSI and LiDAR feature extraction. These locally adaptive-size filters have a small-size kernel near the edge to protect the information from being disturbed by nearby objects, while the kernel size is larger at the center of the area to filter noise and abnormal points. Since structure-aware filter size could reflect the spatial structure of objects more precisely, the convolutional procedure can be more elegant and the discriminability of extracted features can be promoted. To our best knowledge, this is the first time a method of extracting structure-aware    Journal of Remote Sensing features for HSI and LiDAR data processing is discussed (ii) Second, shearlet transform is employed for structure description on both HSI and LiDAR data. After dividing the shearlet features into low-frequency and high-frequency energy parts, they are converted into area feature and texture feature, respectively. Since the local region and edge structure of objects can be well characterized by the extracted area and texture features, both features are taken into account (conventional methods only use either feature for edge detection and noise reduction), which provide a valuable guidance for the subsequent superpixel fusion (iii) Third, we proposed an elaborate design to effectively combine HSI and LiDAR information in a distance measurement for gradual region merging. The well-designed evaluation criterion integrates the spectral (from Euclidean distance viewpoint), area, and texture (from statistical distance viewpoint) features, which provides a more comprehensive description of the real scene and could greatly increase the structural representation ability of objects (iv) Finally, all the parameters can be either preset and kept unchanged for different HSI and LIDAR datasets or heuristically determined; therefore, the robustness and generalization ability of the proposed ShearSAF method can be ensured. Meanwhile, since the structure-aware feature extraction procedure is unrelated to the training samples and only needed to be calculated once, our Shear-SAF also has high efficiency. The source codes of this work will be available at http://jiasen.tech/ papers/for the sake of reproducibility We would like to point out that the structure-aware filtering design presented here is essentially very general and can be easily utilized for other features (such as morphological and attribute features). Experimental results with several state-of-the-art methods on three real datasets demonstrate the effectiveness of the proposed ShearSAF approach.
The organization of this paper is as follows. Section 2 introduces related works about shearlet transform, SNIC superpixel algorithm, and KPCA. Section 3 describes the process of designing the shearlet-based structure-aware filtering in detail. Section 4 presents the experimental data and two ablation experiments. The experimental results of the proposed ShearSAF method with a number of alternatives are given in Section 5. Finally, Section 6 provides the conclusions of this paper.

Related Works
This section introduces the theory of shearlet transform, SNIC superpixel algorithm, and KPCA.

Shearlet Transform.
Shearlets have received great attention for their optimal approximation properties in representing images and were first introduced in [43,82]. The shearlet is regarded as a multiscale representation system and possesses the ability to capture the direction and geometric features [50,53]. Suppose there exist a dilation matrix and a shear matrix where a ∈ ℝ + and s ∈ ℝ are called the dilation factor and shear factor, respectively, the shearlet mother function is expressed as where t ∈ ℝ 2 is a translation factor, and x ∈ ℝ 2 is the coordinate in the spatial domain. a and s, respectively, control the scale and orientation of the shearlet. In the frequency domain, function ψ is written as b ψ and can be factorized as where ω 1 and ω 2 are the two coordinates in the frequency domain. Consequently, ψ a,s,t ðxÞ in the frequency domain can be expressed as follows: where ω ∈ ℝ 2 is the coordinate in the frequency domain, which is just the concatenation of ω 1 and ω 2 . b ψ 1 and b ψ 2 are the continuous wavelet function and bump function, respectively, meeting a certain support domain.
In fact, the shearlet compact framework in the frequency domain can be divided into three parts: the low-frequency region, horizontal cone, and vertical cone. Notably, the above factorization (4) and equation (5) are used for the horizontal cone (in the following section, we renamed b ψ a,s,t ðωÞ in equation (5)as b ψ h a,s,t ðωÞ). Alternatively, in the vertical cone, they are denoted as Journal of Remote Sensing For the low-frequency domain, the shearlet has neither dilation nor shear, so it can be written as where b ϕ is called the scaling function. The frequency support of the low-frequency region, representative horizontal cone, and representative vertical cone are shown in Figure 2.
In practice, the above continuous shearlet needs to be discretized in digital image processing. Let us consider a single-band digital image I ∈ ℝ X×Y mapped into a twodimensional gridfðx/X, y/YÞ | 1 ≤ x ≤ X, 1 ≤ y ≤ Yg; therefore, the related parameters, such as a, s, and t, can be computed as where j 0 is the number of scales. By substituting these discretized parameters (9) into equations (5), (7), and (8), three shearlets in the frequency domain can be rewritten as (the fixed factor where ω 1 = bM/2c, ⋯, bM/2c − 1 and ω 2 = −bN/2c, ⋯, bNS /2c − 1. To avoid overly complicated mathematical formu-las and theoretical derivations, we directly give the final expression of the shearlet transform: where ifft2 represents two-dimensional inverse Fourier transform andÎðω 1 , ω 2 Þ is the response of I in the frequency domain. In this equation, the support domains of the first, second, and third parts are the low-frequency region, horizontal cone, and vertical cone, respectively. Additional details of shearlet construction and derivation can be found in [83].

Multichannel SNIC.
SNIC represent the simple noniterative clustering superpixel algorithm, which has both low computational complexity and good segmentation results [66]. Through parameter control in SNIC, the number of superpixel and the weight of spectral-spatial information can be set manually. In this article, the multichannel SNIC is adopted, which is more suitable for multichannel hyperspectral image segmentation. Specifically, SNIC starts from the initialization of the centroid and adds the elements into a priority queue. Next, when an element is taken from the priority queue, the surrounding pixels are marked and added to the queue. At the same time, the coordinates of the centroid are updated accordingly. This process will continue until the queue is empty. The comparison criterion of the priority queue is the distance between the elements i and centroid j, which is defined as follows: where r and x represent the spectral vector and space coordinates, respectively, and γ 1 and γ 2 are their corresponding weights.
2.3. Superpixel-Guided KPCA. As a generalization of principal components analysis (PCA), KPCA maps the input data into a high-dimensional or Hilbert space by a mapping function and can well reflect the complex structures in the corresponding high-dimensional space [84,85]. However, in the sample selection strategy of original KPCA, random sampling and conducting all sample (there may be a million levels) strategies usually cause feature degradation and computational explosion. Therefore, a superpixel-based KPCA scheme by taking advantages of superpixel homogeneity is applied for dimension reduction [86]. Specifically, in terms of a raw HSI R ∈ ℝ X×Y×B (X, Y and B are, respectively, the spatial and spectral dimensions), SNIC is applied (here, the number of superpixels is simply as X + Y) and a superpixel map is obtained. Then, the mean vector of each region is calculated and forms the input sample set r = ðr 1 , r 2 , ⋯, r X+Y Þ. Subsequently, the mapping function Φ converts the input low-dimensional sample data into a high-dimensional feature ΦðrÞ = ðΦðr 1 Þ, Φðr 2 Þ, ⋯, Φðr X+Y ÞÞ. Let us consider the following covariance matrix: Therefore, characteristic equation can be denoted as where λ = diag ðλ 1 , λ 2 , ⋯, λ X+Y Þ is a diagonal matrix composed of eigenvalues arranged from large to small and β = ðβ 1 , β 2 , ⋯, β X+Y Þ is an ðX + YÞ × ðX + YÞ matrix composed of corresponding eigenvectors. For convenience of calculation, an ingenious substitution is made for β: where α = ðα 1 , α 2 , ⋯, α X+Y Þ is theðX + YÞ × ðX + YÞ coefficients matrix, which is used for explaining the relationship between β and ΦðrÞ. Next, through simultaneously multiplying a matrix ΦðrÞ T by the left-hand side of equation (16), we can obtain where K = ΦðrÞ T ΦðrÞ is known as the kernel function. For equations (18) and (19), λKα = ð1/ðX + YÞÞK 2 α can be optimized into ðX + YÞλα = Kα, which is regarded as a new characteristic equation. Finally, for each spectral vector R i of R, the dimensionality reduction process can be expressed as where K represents the reserved dimension.

Shearlet-Based Structure-Aware Filtering
We now consider the proposed shearlet-based structureaware filtering design. Briefly, in order to take full advantage of the structural information of objects, the following idea is complied: when a certain point approaches the edge, the size of the filter will shrink, while when it is at the center, the size of the filter will enlarge. Our ShearSAF approach to obtain this adaptive-size filter involves the following four steps: preprocessing, shearlet-based feature extraction, gradual region merging, and structure-aware filter designing. Table 1 summarizes some important mathematical symbols used in this paper for additional clarification.
3.1. Preprocessing. This part mainly includes two aspects: superpixel-guided KPCA for HSI dimension reduction and multichannel SNIC for superpixel oversegmentation.

Superpixel-Guided KPCA for HSI Dimension Reduction.
For high-dimensional HSI data with complex structures, KPCA has superior capabilities for dimensionality reduction. In our superpixel-guided KPCA, the radial basis function ÞÞ is adopted and 99.5% energy is maintained in the principal components. Afterward, the information-focused hyperspectral data H ∈ ℝ X×Y×K is attained.

Multichannel SNIC for Superpixel Oversegmentation.
SNIC is an emerging superpixel algorithm containing both low computational complexity and good segmentation results. The multichannel SNIC is applied on raw HSI data R, and an initial oversegmentation map S N can be obtained, in which the homogeneity of each superpixel can be largely ensured. Instead of directly providing the number of superpixels, the number of pixels inside each superpixel is set as N p (the value of this parameter will be discussed in the experimental section), and weight parameters γ 1 and γ 2 are set as 1/B and 0.5 as default, respectively. it can effectively obtain texture information and is thus used for edge detection and corner detection [55]. Alternatively, in the low-frequency part, it can effectively obtain area information and is thus used for denoising [87]. Let us start with the single-band LiDAR data L ∈ ℝ X×Y . The number of scale j 0 in equation (9) is set as 3 by default in our shearlet transform, while the construction of b ψ 1 (i.e., the Meyer wavelet function) and b ψ 2 (i.e., the bump function) is the same as [83].
In the shearlet compact frame, when the scale is 0, 1, and 2, respectively, there are 4, 8, and 16 support cones with different directions (including the horizontal cone and vertical cone). Among them, the 16 highest frequency part is related to the texture information, while the remaining parts can well characterize the area information; therefore, the shearlet-based frequency features are divided into two parts as follows: where SH H ðLÞ and SH R ðLÞ, respectively, represents the highest frequency and the rest frequency information of the LiDAR data L. Furthermore, for the highest frequency parts SH H ðLÞ, i.e., j = 3, the sum of coefficients in 16 directions is used as the measure of texture feature L E . For the other remaining 13 frequency parts SH R ðLÞ, including the low-frequency region and remaining high-frequency cones with j = 1 and j = 2, the inversion of the shearlet transform is applied to acquire the area information L A . They can be computed as follows: where inv is the inversion operator and j·j is the absolute value operator. Correspondingly, for the KPCA-reduced HSI data H = ðH 1 , H 2 , ⋯, H K Þ, each component performs the above frequency separation process, and then the results are concatenated. Therefore, the texture information H E and area information H A for HSI data can be expressed as where H E i and H A i contain the texture and area information of the ith component H i , respectively. The detailed procedure of shearlet-based feature extraction is displayed in  Specifically, the oversegmentation map S N is mapped onto an undirected graph. Each superpixel is regarded as a node, and there exists edge only when two superpixels are adjacent. In order to make the description of the proposed progressive region merging process more clear and intuitive, it is divided into two parts: merging cost definition and region merging procedure. Figure 4 illustrates the gradual region merging procedure.
3.3.1. Merging Cost Definition. It can be easily found that the merging cost between two adjacent regions is not only related to the size of the region and the length of shared edges but also related to the similarity among the three different kinds of features, including spectral, area, and texture features. Suppose there are two adjacent regions m and n in the oversegmentation map S N , the distance between the two regions in the spectral domain is calculated by the mean gaps and can be defined as follows: where H i ðmÞ and LðmÞ represent the mean value of region m in the ith band of KPCA-reduced HSI data H and LiDAR data L, respectively.
On the other hand, for the area information ðH A , L A Þ and texture information ðH E , L E Þ, it is necessary to adopt statistical manner to measure the region distance since all the area and the texture features are extracted in the frequency domain. Taking the LiDAR texture information L E as an example, it is firstly normalized into the interval [0, 256] for convenience. Then, we select r interval endpoints (x 1 < x 2 < ⋯ < x r , including 0 and 256) to divide the whole interval into r − 1 parts with the same length. At the same time, these r endpoints are considered as r bins in the histogram and its value in region m is denoted as follows (this process can be seen in Figure 5): where L E ðmÞ represents the LiDAR texture information value in region m. Thus, the frequency histogram in region m is calculating by After obtaining the frequency distribution in each region, the G-statistic distance measurement is applied for two adjacent regions m and n: can be obtained correspondingly.
Since the information contained in each spectral band of H and LiDAR L is inconsistent, it is necessary to fuse these distance measures in a weighted manner, which is computed based on the homogeneity of each segmentation area [88]. Specifically, if the segmentation area has good homogeneity, it should have high weight. Conversely, if the segmented area is heterogeneous, the weight value should be small. In our framework, a locally adaptive approach is implemented.
where max ð f i Þ is the maximal frequency in the corresponding band, while G A m,n and G E m,n represent the area distance and texture distance, respectively.
As indicated so far, the dissimilarity of two adjacent region can be defined by where δ is the balance factor addressing that the spectral distance and statistical distance are at the same order of magni-tude. In our experiment, δ is set as 0:001 and the interval endpoints r is set as 17.
As mentioned before, the merging cost of region m and n is not only related to the dissimilarity D m,n but also related to the size of the region and the length of shared edges. The smaller the size of the region and the larger the shared boundary between the two regions, the easier the two regions merge together. Based on this point of view, the merging cost C m,n of region m and n is defined as where L m,n is the length of shared boundary of region m and n, while S m and S n represent the number of pixels in regions m and n, respectively.

Region Merging Procedure.
A progressive region merging technique is introduced to effectively alleviate the conflict between oversegmentation and undersegmentation of superpixels and largely guarantee the homogeneity of the final merging map. Oversegmented superpixels ensure the homogeneity of each region, while region merging that gradually combines two adjacent similar regions does not introduce an undersegmentation problem with the help of shearlet extracted features. Specifically, for the initial oversegmentation map S N , a data structure is utilized to record each pair of adjacent nodes with their merging cost, and a priority queue (denoted as Q) is built to store all these structure. Based on the queue, the structure with the smallest cost is chosen and the corresponding two regions (called m and n for similarity) are obtained as well. Subsequently, all structures related to m and n in the priority queue are removed. Through adding all points in region n into region m, some new structures are created to record the reconstructed region m and its bin (1) bin (2) bin (3) bin (4) bin (5) bin (6) bin (7)    Journal of Remote Sensing neighborhood, which are then put into the priority queue. This progressive region merging procedure is carried out until the number of regions reaches a predefined value N. At last, the oversegmentation map S N is gradually transformed into a final merging map M.
3.4. Structure-Aware Filter Designing. For a point close to the edge, the surrounding labels are more likely to be different for object classifications, indicating that the neighboring spatial relationship should have less consideration. However, when it is located in the center of a local region, the surrounding objects tend to be the same; thus, the neighboring spatial relationship should be paid more attention. This perspective motivates us to design an adaptive structure-aware filter whose kernel size changes with the distance from a point to the edge.
In fact, it is difficult to obtain accurate edges between objects in HSIs due to the inherent low-spatial resolution of remote sensing images. Fortunately, through applying the well-designed shearlet-based gradual region merging scheme on the SNIC oversegmentation map S N , a final merging map M with lower space constraint conflicts is thus achieved, in which the homogeneity of local regions is largely ensured. Meanwhile, the junctions between regions are regarded as edges. In particular, for each point p in M, its region boundary must be a continuous closed curve, which means the number of edge points is limited. Therefore, all spatial distances between this point and its region boundary can be calculated. The smallest value is selected to form the distance map D. This process can be expressed by the following formula: where P is the spatial position matrix of M, p′ represents a point of region boundary L, ðp x , p y Þ, and ðp′ x , p′ y Þ represents the two-dimensional spatial coordinates of points p and p ′ , respectively. However, the direction from different points p to their nearest boundary point is not fixed, implying that directly using D p as the filtering size may cause the filter to be oversized and introduce some disturbing information of other ground objects. As we know, the diagonal of the square is longer than the other inner straight lines. In other words, as long as the diagonal length of the adaptive-size filter is less than D p , the filter centered by point p will not exceed the boundary. Therefore, we convert the distance map D into the so-called scale map S: In addition, when the point p is at the center of the region, an overly large filter size may contain more outside-region points, which could degrade the feature representation ability.
Thus, a threshold-truncated method is introduced: In our experiments, the threshold T is simply set as 55. Figure 6 illustrates the three circumstances of filter size determination procedure. Concretely, the dotted frame centered on p1 is the filter with 2 × bD p1 c + 1, while the solid frame centered on p1 is the filter with S p1 . For the point p2, the dotted frame and solid frame represent the filters without a threshold process and with threshold process, respectively. Clearly, the dotted frames of p1 and p2 are more precise for filter size than those two solid frames. Besides, for the region edge point such as p3, the filter size is only 1 * 1, which obeys our filter size calculation process as well.
A final note is that all the points in the scale map S are assigned an odd value ranging from 1 to 55, indicating the filter size with each pixel. For each spatial pixel p, the corresponding structure-aware filter F p ∈ ℝ S p ×S p is formulated as where A represents a matrix where all element values are 1.
Obviously, F p can be considered as a mean filter with adaptive size for each spatial pixel, which can be visually seen in Figure 1. Hence, the obtained adaptive-size filter achieves structure-aware based on the geometric position of the convolution center. This flexible filter can well protect the difference of different objects on the edge, while reducing the abnormal points in the center area.

Feature Extraction and Classification.
Since the edges in the final merging map M may not be accurate edges, classification errors can occur more frequently near the edge. Hence, the formulated structure-aware filter F is solely used for feature extraction rather than regularization of classification results. Taking the LiDAR data L as an example, the filtering process on each spatial pixel p can be expressed as follows: where ⊗ is the convolution operator. After applying the convolution procedure on each pixel in P, the feature Z L ∈ ℝ X×Y p 2 p 3 p 1 Figure 6: The process of filter size determination. 10 Journal of Remote Sensing can be extracted. Similarly, through applying the convolution procedure on each band of H, the corresponding feature cube Z H ∈ ℝ X×Y×K can be obtained. By concatenating both the features Z L and Z H along the spectral direction, the final feature Z ∈ ℝ X×Y×ðK+1Þ can be thus achieved. During classification, random forest (RF) classifier is chosen, which can not only achieve high classification accuracy but also possess fast computation speed. Meanwhile, RF has advantages for antioverfitting and antinoise. Notably, RF consists of two steps: randomly selecting repeatable training subsets and building multiple decision trees, which involves bagging sampling techniques. In the experiments, the default subspace of RF is the floor of the logarithmic value of the features, and the number of trees in the forest is set as 500. Finally, by employing the RF classifier on the extracted feature Z, the classification map C can be thus obtained. At last, the pseudocode of the proposed ShearSAF approach for HSI and LiDAR feature extraction and classification is outlined in Algorithm 1.
The computational complexity of our proposed Shear-SAF can be divided into three parts. Firstly, the complexity of SNIC and SNIC-guided KPCA are OðXYÞ and OðXY + ðX + YÞ 2 Þ, respectively, while computational complexity of shearlet transform is OðXYK log ðXYÞÞ. Secondly, since the number of adjacent nodes for a region is limited, the computational complexity of priority queue is OððXY/N p Þ log ðXY /N p ÞÞ (N p = 50 in our experiments). Thus, the complexity of the region merging process is OðððXY/N p Þ − NÞðXY/N p Þ log ðXY/N p ÞÞ. Finally, the computational complexity of the convolution process and RF classification is OðXYKÞ and O ðXY log ðKÞÞ, respectively.

Experimental Data and Ablation Analysis
In this section, three real HSI and LiDAR datasets in diverse areas are used to evaluate the effectiveness of the proposed ShearSAF framework. Firstly, the three HSI and LiDAR datasets are presented. Secondly, the parameters contained in ShearSAF are analyzed. Thirdly, two ablation experiments are carried out to validate the advantage of the welldesigned structure-aware filtering scheme and the superiority of the proposed ShearSAF method over other related filters.

Datasets
4.1.1. Houston Dataset. The first dataset is captured over the University of Houston campus [18], in which the Houston HSI contains 144 spectral bands ranging from 380 to 1050 nm. Each band contains 349 × 1905 pixels with 2.5 m of spatial resolution. Meanwhile, the corresponding LiDAR data has the same spatial size with the height information of surface materials. Fifteen land-cover classes and 15,029 labeled samples are given in the ground-truth image, as shown in Table 2 and Figure 7. 1: INPUT: raw HSI data R ∈ ℝ X×Y×B , LiDAR data L ∈ ℝ X×Y ; 2: OUTPUT: the classification map C ∈ ℝ X×Y ; 3: BEGIN 4: N p = 50, j 0 = 3, δ = 0:001, r = 17, T = 55; 5: using SNIC on R to obtain the over-segmentation superpixel map S N with X × Y/N p regions; 6: using KPCA on R to obtain the information-focused HSI data H ∈ ℝ X×Y×K ; 7: using equation (22) and (24) to obtain the texture and area information H E , L E , H A and L A ; 8: using equation (33) to calculate the merging cost C; 9: conducting progressive region merging procedure on S N and obtain the final merging map M ∈ ℝ X×Y , where the number of regions N in M is calculated by equation (39); 10: using equation (34) to obtain distance map D; 11: using equations (35) and (36) to obtain scale map S; 12: using equation (37) to obtain structure-aware filter F p for each spatial pixel p; 13: using equation (38) to obtain convolution feature Z ∈ ℝ X×Y×ðK+1Þ from H and L; 14: using RF classifier on Z to achieve classification map C; 15: END Algorithm 1: ShearSAF for HSI and LiDAR feature extraction and classification.  [89]. Each band is 600 × 166 pixels with a spatial resolution of 1 m. Likewise, the LiDAR data only has one band of the same spatial size. The six land-cover classes and 30,414 labeled pixels are listed in Table 3 and Figure 8.  [90,91]. The spatial size of both HSI and LiDAR data is 325 × 220 with a spatial resolution of 1 m. After removing eight noisy bands from the original 72 bands of the HSI data, 64 spectral bands are employed in the experiment. The details are given in Table 4 and Figure 9.

Parameter
Setting. In our proposed ShearSAF framework, there are several parameters that should be carefully specified. Concerning the scale parameter (j 0 ) for Shearlet transform, it is set as 3 according to their original paper. With respect to the weight parameters for SNIC, γ 1 and γ 2 , they, respectively, correspond to the spectral and spatial dimension and thus are set as 1/B and 0:5. Meanwhile, the number of internal endpoints r is set as 17 to facilitate the subsequent G-statistic distance computation. For the dimension K in KPCA, the corresponding number should guarantee that 99.5% energy is reserved.
In fact, there are two parameters in the gradual region merging procedure that are necessary to be determined: the initial number of pixels inside superpixel block N p in the oversegmentation map S N and the number of regions N in the final merging map M. In fact, it is difficult to obtain the final number of homogeneous regions N as a fixed value for different datasets because of the impacts of object distributions, spatial complexity, and so on. Here, we propose a heuristic way to calculate N, which contains the class number (C), spatial complexity (σ), and space size (X and Y).
where b·c is the floor operator. σ is defined as follows: the Sobel operator is adopted on the three normalized principal components of HSI and normalized LiDAR to calculate their gradients, and then, the sum of absolute values divided by 1 0 5 is used as the spatial complexity. By this heuristic method, σ is 7:63, 1:44, and 1:25 and N is 3375, 112, and 180 for Houston, Trento, and MUUFL Gulfport datasets, respectively. To prove the effectiveness of our strategy, we conduct a series of experiments to track the process of gradual region merging and record the overall accuracy (OA, which is computed by dividing the correctly predicted samples with the number of testing ones) varying with different N p and N. Figure 10 shows that the OA varies with the parameter N p and N for the Houston, Trento, and MUUFL Gulfport dataset. Here, the parameter N p ranges from 20 to 100 with the steps of 10, and then, the parameter N ranges from 50 to 600 with the steps of 50 for the Trento dataset and MUUFL Gulfport dataset, while it ranges from 1000 to 6000 with steps of 500 for the Houston dataset. As far as the small sample set scenario is concerned, only 3, 5, 10, and 15 samples per class are randomly chosen from the labeled set, and the remaining   Journal of Remote Sensing labeled samples are used for testing. Each experiment is executed 20 times to obtain the mean value. It can be seen from Figure 10 that the OA is better when N are, respectively, 3500, 100, and 200 for the three datasets, which are close to the values that are heuristically calculated by equation (39). Two more observations can be found from Figure 10 First, the OA increases first and then decreases with decreasing N. This is reasonable because the adjacent regions with similar objects are merged to improve feature performance at the beginning, while two adjacent areas with different objects are merged after N reaches the critical value, leading to a decline in classification performance. The second is that OA has a slight increase with the decrease in N p in the three datasets. In fact, the parameter N p is used to ensure the homogeneity of each oversegmented region so that too many pixels inside the superpixel region would decrease the homogeneity. In the experiment, N p is set as 50 for the three datasets, which not only keeps the region homogeneity but also promotes the calculation speed in region merging. To be more clear, Figure 11 illustrates the result of the gradual region merging procedure on the three datasets. It can be easily observed the structure information of various materials can be well represented.
At last, the parameter setting in the proposed ShearSAF approach is summarized in Table 5. Apparently, all the parameters included can be either preset and kept unchanged for different experimental datasets (such as N p , j 0 , and r) or heuristically computed (such as N and K); hence, the robustness and generalization ability of ShearSAF can be guaranteed, which is a distinct advantage of the proposed ShearSAF approach.

Ablation Analysis.
In this part, two ablation experiments are carried out to validate the effectiveness and superiority of the proposed structure-aware filtering scheme. On the one hand, our ShearSAF is compared with fixed-size mean filters, whose kernel sizes range from 1 to 55 with a step of 2. That is,     14 Journal of Remote Sensing and MUUFL Gulfport datasets is illustrated in Figure 12. It should be mentioned that the four curves from the bottom to the top (blue, red, green, and black) indicate the performance of the mean filters with a fixed size under the conditions of 3, 5, 10, and 15 training samples per class as the training set, respectively, and correspondingly, the horizontal dotted lines from the bottom to the top (blue, red, green, and black) represent the performance of ShearSAF with the same training set, respectively. It can be easily observed from Figure 12 that the OA of the curve rises when the filter size is relatively small. Analytically, the ability to filter noise and abnormal points is improved as the kernel size increases for considering more neighborhood relations. Then, the OA drops when the filter size increases continuously. This is because the continuous increase in the filter size will damage the feature performance at the junctions of objects. Moreover, it can be clearly seen that our ShearSAF approach always shows the best performance, implying that our structure-aware filter design does protect the edges and filter the noise in the center region. Besides, it is worth mentioning that the kernel size of the optimal filter is inconsistent for different datasets. For the three real datasets concerned here, as illustrated in Figure 12, the optimal filter size is 9, 7, and 3 for Houston, Trento, and MUUFL Gulfport, respectively, and thus, the filter size is hard to be determined in advance in practice. Alternatively, our structure-aware filter design can automatically adjust the filter size according to the well-designed scale map and achieve higher accuracy, indicating the advantage and feasibility of the proposed ShearSAF approach.
Alternatively, our structure-aware design with other filters is also examined, as illustrated in Figure 13. Here, both the Gaussian and Gabor filters are taken into consideration. Specifically, the ShearSAF-Gaussian means that twodimensional (2D) Gaussian with structure-aware size is applied on the stacked HSI and LiDAR data. In other words, we obtain the scale map in the same way as the ShearSAF, and each point in the obtained scale map represents the corresponding Gaussian filter size. Then, the structure-aware Gaussian filters are convolved with the stacked HSI and LiDAR data to achieve the related features. At last, the RF classifier is utilized for classification. Similarly, a series of 2D Gabor filters (four scales and six orientations) with adaptive spatial size is applied on the stacked HSI and LiDAR data for feature extraction, called ShearSAF-Gabor. It can be seen from Figure 13 that ShearSAF-Gabor performs better than ShearSAF-Gaussian on the Trento dataset, while the opposite situation can be observed in the Houston and MUUFL Gulfport datasets. This is reasonable since the spatial distribution of objects in the Trento dataset (as shown in Figure 8) is more regular than the rest two datasets (as shown in Figures 7 and  9); the features obtained by the 2D Gabor filters with various orientations and scales can be more specific than those extracted by the Gaussian filter. Furthermore, the proposed Shearlet scale parameter j 0 3 Number of interval endpoints r in equation (27)

Experimental Results
In this section, a number of state-of-the-art feature extraction and fusion algorithms are incorporated to compare with the proposed ShearSAF approach. Firstly, two simplest methods, including the RF classifier on the raw HSI data (named as Raw-H) and on the concatenation of both HSI and LiDAR data (named as Raw), are used as the benchmark. Secondly, three deep learning-based methods, 3D-CNN (3D convolutional neural network [34], a classic deep learning-based method that can simultaneously capture spatial-spectral joint information), miniGCN (minibatch graph convolutional network [92], an emerging deep learning-based method that allows to train large-scale GCNs in a minibatch fashion), and SAE-LR (stacked autoencoder with logistic regression [28], an autoencoder-based deep learning method that can preserve abstract and invariant information in deeper features), are taken into consideration for HSI and LiDAR data classification. Thirdly, five widely used feature extraction and fusion algorithms, that is, NMFL (nonlinear multiple feature learning-based classification [93] that explores different types of available features in a collaborative and flexible way), EMAP (extended morphological attribute profile [94]), GGF (generalized graph-based fusion [22]), EPCA (a novel ensemble classifier [95]), and OTVCA (orthogonal total variation component analysis [96] that can get the best low-rank representation and show strong antinoise ability), are also conducted on both HSI and LiDAR data. For the classification issue, 3 to 15 samples per class are randomly selected from the labeled dataset to form the training set, while the rest are used for the testing set. At the same time, each experiment is run twenty times in order to reduce the effects of random factors. Both the mean values and standard deviations are reported. Except the OA measure, the kappa coeffi-cient (κ), which reflects the impact of classes, is also adopted to evaluate the classification performance. Figures 14-16 show the OA and κ of the eleven compared methods including the Raw-H, Raw, 3D-CNN, miniGCN, SAE-LR, NMFL, EPCA, GGF, EMAP, OTVCA, and our ShearSAF when the training set ranges from 3 to 15. It should be noted that the OA obtained by a single LiDAR data is much smaller than that of other methods; thus, it has not been added for comparison. Generally, the classification performs better as the number of training sample grows for the three datasets. Compared to the Raw method, the performance of the Raw-H that just uses HSI data shows lower classification accuracies, confirming that the supplement of LiDAR information can improve the performance of HSI classification. Specifically, HSI data provides abundant spectral information for distinguishing materials with different physical properties, while LiDAR provides shape and height information that can be used to distinguish different targets of the same material. For the rest compared methods, the proposed ShearSAF outputs the highest results all the time, which is reasonable since the designed structure-aware filters can reduce its size to avoid interclass interference at the near edge and introduce more neighborhood information to reduce the environmental impact at the region center.
In addition, it should be noted that deep learning-based methods behaved badly for three datasets for limited training samples. Specifically, SAE-LR gives the worst performance on the Houston and MUUFL Gulfport dataset, while 3D-CNN performed the worst on the Trento dataset and the second worst on the Houston dataset. As for miniGCN, it is also lower than the traditional classification method in most cases. Analytically, the deep learningbased methods usually need a great quantity of training samples to constantly modify the magnanimous parameters in the process of training model. But the small sample set in the experiments significantly limits the performance of deep learning-based methods. Meanwhile, the training process of deep learning methods requires considerable time consumption as well.  16 Journal of Remote Sensing Furthermore, when there are only five training samples per class, the classification performances, including each class accuracy, OA, and κ of the eleven methods, have been summarized in Tables 6-8 for the Houston, Trento, and MUUFL Gulfport datasets, respectively. It can be seen that ShearSAF outputs the best performance in most cases, which

17
Journal of Remote Sensing favors the superiority of our ShearSAF method. In more detail, considering the C5 class (vineyard) of the Trento dataset, it can be found from the ground-truth map (Figure 8) that the spatial distribution of C5 is very regular, and Shear-SAF effectively filters the noise in the area and protects the edges; thus, the performance increases from 72.58% for the RAW method to 98.23% for our approach, as illustrated in Table 7. Alternatively, concerning the C10 class (yellow curbs) of the MUUFL Gulfport dataset in Table 8, it has scattering in the scene and is even hard to be seen in Figure 9.  Although our method is not optimal in the C10 class, this structure-aware filter does work for reducing its own size and keeping the target information from being interfered by neighboring objects. To illustrate, the ground-truth and the complete classification maps for all the three datasets of the eleven compared methods are shown in Figures 17-19. It can be easily observed that our ShearSAF approach is outstanding compared to the others, demonstrating the effectiveness of the proposed method. Finally, when there are five training samples per class, the computation time is given in Table 9, which was recorded by a workstation with a 24-core Intel processor at 2.20 GHz with 128 GB RAM. As expected, the deep learning-based method (3D-CNN, miniGCN, and SAE-LR) takes more times than the others because model training and parameter optimization require considerable time. It can be observed that the time cost of our ShearSAF method is less than that of the other methods, which is mainly due to the irrelevance of the structure-aware feature extraction procedure with the training set. That is to say, the feature extraction procedure in ShearSAF method is executed only once, while RF classifier has low computational cost; therefore, the proposed ShearSAF method is computationally efficient and is applica-ble for remote sensing image with large spatial size, which proves the superiority of our method once again.

Conclusions
In this paper, a newly designed shearlet-based structureaware filtering approach has been proposed for HSI and LiDAR feature extraction. Specifically, the shearlet transform is implemented on the KPCA-reduced HSI and LiDAR data for area and texture feature extraction. Then, the spectral, area, and texture features are used to guide the gradual region merging procedure, which converts the initial oversegmentation map into a final merging map, and the spatial structure of objects can be well characterized. By calculating the edge distance in the final merging map, the scale map can be acquired, which is utilized to adaptively select the filter size for convolution. Finally, the RF classifier is used for classification.
In summary, the most important contribution of this article involves the design of the structure-aware filtering design. In this process, we innovatively proposed a shearletbased area and texture feature representation that could effectively measure the distance between two adjacent areas.