Mapping Tree Species Using Advanced Remote Sensing Technologies: A State-of-the-Art Review and Perspective

Timely and accurate information on tree species (TS) is crucial for developing strategies for sustainable management and conservation of artificial and natural forests. Over the last four decades, advances in remote sensing technologies have made TS classification possible. Since many studies on the topic have been conducted and their comprehensive results and novel findings have been published in the literature, it is necessary to conduct an updated review on the status, trends, potentials, and challenges and to recommend future directions. The review will provide an overview on various optical and light detection and ranging (LiDAR) sensors; present and assess current various techniques/methods for, and a general trend of method development in, TS classification; and identify limitations and recommend future directions. In this review, several concluding remarks were made. They include the following: (1) A large group of studies on the topic were using high-resolution satellite, airborne multi-/hyperspectral imagery, and airborne LiDAR data. (2) A trend of “multiple” method development for the topic was observed. (3) Machine learning methods including deep learning models were demonstrated to be significant in improving TS classification accuracy. (4) Recently, unmanned aerial vehicle(UAV-) based sensors have caught the interest of researchers and practitioners for the topic-related research and applications. In addition, three future directions were recommended, including refining the three categories of “multiple” methods, developing novel data fusion algorithms or processing chains, and exploring new spectral unmixing algorithms to automatically extract and map TS spectral information from satellite hyperspectral data.


Introduction
1.1. Significance of Tree Species Information. Over the last four decades, advances in remote sensing technologies have made tree species (TS) classification possible from various remote sensing sensors' data. Timely and accurate information on the status and structure of TS and forest species composition is crucial for developing strategies for sustainable management and conservation of artificial and natural resources. Such TS information is needed for many application purposes. They may include, but are not limited to, aiding forest resource inventories [1], assessing urban living environments (health, wellbeing, and aesthetics) [2], assessing and monitoring biodiversity [3], estimating urban vegetation biomass and carbon storage [4], and monitoring invasive plant species [5]. In this review, in general, TS is at the species level in a general plant classification system (i.e., variant, species, genus, and family); however, due to the different purposes of the reviewed studies using the same or similar remote sensing data and TS classification techniques/methods, studies on classifying tree variants, species groups, and even a few of forest (stand) species and types or urban TS were also reviewed. In practice, to collect the tree canopy structure and species information, there are two traditional ways, including field measurement and aerial photograph interpretation. However, obtaining the information through the traditional methods is usually time consuming and costly/expensive, especially over large areas. Remote sensing technologies, especially satellite remote sensing techniques, have the advantages of overcoming the shortcomings of the traditional methods to rapidly obtain the TS information at a local, regional, or even global scale. However, previous research has proved that accurately classifying individual TS and mapping tree canopy and structure using moderate-resolution satellite data are difficult or even impossible [6][7][8][9][10][11].
During the last couple of decades, remote sensing technologies have advanced in improving spatial and spectral resolutions (e.g., IKONOS with four multispectral (MS) bands at 4 m resolution and one panchromatic (pan) band at 1 m resolution and Hyperion hyperspectral sensor with more than 200 bands at a 10 nm spectral resolution). Very high spatial resolution (VHR) satellite images have demonstrated to be a cost-effective alternative to aerial photography for creating digital maps [12] and mapping TS [13]. Meanwhile, various hyperspectral (HS) data (e.g., Airborne Visible Infrared Imaging Spectrometer (AVIRIS) and Hyperion) have been used to classify and map TS, presenting a certain degree of success (e.g., [10,[14][15][16][17]). More recently, various light detection and ranging (LiDAR) techniques and unmanned aerial vehicle-(UAV-) based sensor techniques have been developed. Such advanced remote sensing techniques, especially VHR satellite remote sensing, have provided opportunities to identify TS and map individual trees (e.g., [18][19][20]).

Review
Objectives. There were six review papers in the existing literature closely related to this review. Table 1 outlines focuses and objectives of the six review papers. Table 1 shows that, except work by Fassnacht et al. [21], all other review papers cover less components than those covered by this paper, either just assessing the accuracy of the individual tree-based forest inventory techniques and LiDAR data alone or reviewing studies focusing on an urban environment only. In this review, this paper provides an overview on applications of remote sensing sensors' data to TS mapping, especially the use of VHR satellite MS images, airborne and UAV-based MS and HS images, and airborne LiDAR data. The main objectives of this paper are as follows: (i) Review high spatial/spectral resolution optical remote sensing sensors/systems and LiDAR sensors used for TS classification (ii) Review and evaluate suitable data fusion methods and feature characteristics and selection methods in mapping TS (iii) Provide an overview on, and assess, various techniques/methods for classifying TS (iv) Present a general trend in method development in TS classification (v) Brief limitations, and offer future directions, for classifying and mapping TS using advanced remote sensing technologies Table 1: Summary of focuses of existing major review papers related to this review for tree species (TS) classification.
Authors (year) Title Focus/objective Fassnacht et al. (2016) [21] Review of studies on tree species classification from remotely sensed data Quantify general trends on TS classification in remote sensing studies; provide a detailed overview on the current methods for TS classification with typical sensor types; identify gaps and future trends for TS classification using modern remote sensing data Yin and Wang (2016) [22] How to assess the accuracy of the individual tree-based forest inventory derived from remotely sensed data: a review Provide a review of techniques and methods for individual tree study using remote sensing data; summarize key factors that need to be considered to evaluate individual tree level forest inventory products; discuss existing problems and possible solutions in individual tree studies Koenig and Höfle (2016) [23] Full-waveform airborne laser scanning in vegetation studies-a review of point cloud and waveform features for tree species classification Identify frequently used full-waveform airborne laser scanning-based point cloud and waveform features for TS classification; compare and analyze features and their characteristics for specific tree species detection; discuss limiting and influencing factors on feature characteristics and TS classification   [24] Remote sensing in urban forestry: recent applications and future directions Summarize recent remote sensing applications in urban forestry from the perspective of three distinctive themes: multisource, multitemporal, and multiscale inputs; discuss the potential of remote sensing to improve the reliability and accuracy of mapping urban forests 1.3. Review Approach. Although a total of 231 peer-review journal papers in English language related to TS classification and mapping with various remote sensing sensors' data were reviewed in this study, there were only 153 papers directly cited in the paper (the remaining 78 papers reported their studies with the same sensors' data and the same or similar methods and techniques for TS classification as those directly cited in the paper). Given the fact that there were less qualified papers published in the time span of 1980-2000, this review is more focused on papers published after 2000, especially after 2015. The ISI Web of Science and Google Scholar databases were accessed to search for relevant papers published during the last four decades on the topic based on the following terms in the title (of one paper): (remote sensing OR LiDAR OR UAV) AND (vegetation OR tree OR plant OR forest) AND (classifi * OR map * OR identi * OR discriminat * ). There were about 600 studies meeting the conditions to be found. The studies were then further filtered based on the following criteria all met to form a final list (231) of papers for review in this study: (i) A study must discriminate at least two TS (or species groups) (ii) A study must create and/or present TS classification and mapping results (iii) A study must investigate effect(s) of spatial and/or (AND) spectral AND temporal AND data preprocessing AND feature extraction methods, and TS classification techniques/methods on TS classification (iv) A study must report detailed accuracy assessment results for TS classification This review begins with introducing the significance and importance of TS classification and review objectives and approach. Then, advanced remote sensing sensors/systems suitable for TS classification and mapping are reviewed. Next, techniques and methods of TS classification are reviewed and assessed. As a trend of method development in TS mapping and classification, three categories of "multiple" methods are presented and evaluated. Finally, after limitations and constraints of current techniques and methods are identified and discussed, three future directions for improving TS classification are recommended and discussed.

Advanced Remote Sensing Sensors
Suitable for Tree Species Classification NIR radiation is reflected by multiscattering of internal cellular structure; and SWIR energy is absorbed by water and other biochemical constituents [26]. Plant foliar and canopy spectral variabilities among different species, or even within single tree crowns, are due not only to differences in internal leaf structure/morphology and biochemicals (e.g., thickness of cell walls, water, and photosynthetic pigments) [27][28][29][30] but also to variation in canopy structure/morphology (e.g., leaf and branch density and clumping) [31,32]. In addition, the spectral variabilities are also attributed to differences in phenology/physiology of plant species and in background signals associated with bare soil, litter, herbaceous vegetation, epiphyll cover, and herbivory [30,33,34]. The varying biochemical contents and structural properties among different TS are also dependent upon measured wavelength, pixel size, and ecosystem type [28]. Since modern remote sensing technologies, such as HS sensors, allow identifying plant absorption features that may be associated with different plant species or varieties [35,36], it is critical to find the best wavelengths suitable for species identification in HS remote sensing. For example, to identify invasive species in Hawaiian forests from native and other introduced species by remote sensing, Asner et al. [37] demonstrated that the observed differences in canopy spectral features among the different plant species are related to relative differences in measured biochemicals, structural properties, and canopy leaf area index (LAI). Crown texture information at a pixel level or a single tree crown level has also been explored to improve TS classification. It is mainly related to crown-internal shadows and structure, foliage properties, and branching. Such texture information, at relatively coarser scales, is also associated with crown size, crown closure, crown shape, forest type, and canopy structure and morphology that are main drivers for producing texture information from passive optical sensors' data. Studies on combining spectral with texture features often improve the accuracy of TS classification (e.g., [34,38]). The phenology trait for plant species identification is useful. Phenology includes very obvious processes with seasonal changes, such as deciduous tree leaf color changes in autumn due to leaf senescence (mainly related to changes of various leaf pigments) and in evergreen coniferous forests and leafon and -off change in deciduous forests. Since phenology varies with plant species, it is ideal to use a multitemporal optical remote sensing technique to align the image acquisition time with the phenological period of the tree species under investigation [39,40]. For example, in discriminating TS at different taxonomic levels using multitemporal WorldView-3 (WV3) imagery in Washington D.C., USA, 3 Journal of Remote Sensing Fang et al. [1] observed that the most valuable phenological cycle is fall senescence for TS classification. Therefore, selecting an ideal time-point for acquiring an image to capture phenology-related information is helpful to increase the accuracy of TS mapping.
2.1.2. Active LiDAR Data. LiDAR sensors may be used to measure structural properties and biophysical parameters of forests at single crown and canopy levels from either ranges or intensities recorded by LiDAR sensors [41,42]. Such attributes and parameters measured by LiDAR sensors may include plant height, forest density and basal area, above ground biomass, and LAI at both single tree and stand levels [43][44][45][46]. These attributes and parameters can vary within and between tree species, but are at least partly complementary to passive optical sensors' data for TS classification (e.g., [41]). LiDAR-derived tree height information alone may be of limited value for TS discrimination [21]. Combining optical VHR image data with LiDAR data has been demonstrated to be a very effective strategy for monitoring forest stands, identifying individual tree crowns, and mapping TS (e.g., [41,42,[47][48][49]). LiDAR intensity differences among different TS are mainly caused by the differences of leaf structures of trees featuring larger individual leaves for broadleaf forests and continuous surface and needle leaf coniferous forests [50].

Optical
Remote Sensing (Multi-/Hyperspectral) Sensors/Systems. Currently, a set of optical sensors/systems suitable for remote sensing of TS includes moderate and high spatial resolutions and high spectral resolution airborne and satellite sensors/systems. Table 2 briefs frequently used optical moderate to high spatial/high spectral resolution sensors/systems from sensors' name and characteristics including band setting and platforms.

Moderate Spatial Resolution Sensors.
Typical moderate-resolution satellite MS sensors may include Landsat series, Satellite Pour l'Observation de la Terre High-Resolution Visible (SPOT) series, Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER), and more recently, Sentinel-2 (Table 2), which can be potentially used for forest inventories. Initial attempts with such sensors' images for forest inventories might be limited to map coarse stand condition and stand density, such as SPOT HRV and Landsat TM/ETM+ data [6][7][8][9]51]. Such satellite imagery could also be used to map forest habitat or ecosystems, such as ASTER, Landsat TM/ETM+, or Sentinel-2 [52][53][54]. However, such resolution imagery is difficult or absolutely not possible to accurately map individual TS. A few of studies reported the TS mapping results with the moderate-resolution data to indicate this point due to low spatial resolution (e.g., [11,55]). For example, to map forest stand species (11 species) using Sentinel-2 imagery and environmental data in the Polish Carpathians, Grabska et al. [56] demonstrated the potential of Sentinel-2 image data with terrain information in mapping stand species over large mountainous areas with a high accuracy of 85%. Gillespie et al. [11] proved that using Landsat OLI imagery could map species richness per ha with an accuracy of 42% (based on extracted NDVI images). However, the contribution of the moderate-resolution imagery to mapping TS is made by coupling with other sensors' data and/or other nonsensors' data, which means that the moderate-resolution imagery can aid other sensors' data to improve TS mapping accuracy, such as Landsat TM combined with HS imagery [10], Landsat OLI combined with aerial images [55], and Sentinel-2 MSI combined with VHR satellite images (GeoEye-1 and WV3) [57].
Compared with using moderate-resolution satellite data to classify forest type and map tree species composition, the VHR satellite data have proven their potential to improve TS classification. For instance, using IKONOS imagery acquired over the Iwamizawa region in the center of Hokkaido, northern Japan, Katoh [58] mapped 14 tree species/groups and obtained an average accuracy (AA) of 52%. By using Quickbird data to map four leading tree species, Mora et al. [60] have obtained an overall accuracy (OA) of 73% of the TS classification. WV2/3 images have shown a greater potential for mapping TS than other VHR satellite sensors. Fang et al. [1] used multitemporal WV3 imagery to classify TS and obtained an OA of 61.3% for classifying 19 of the most abundant tree species and 73.7% for mapping ten of the most abundant genera. There are many studies to be done on using other VHR satellite sensors' imagery to map TS with a high accuracy, such as using Pléiades [34,65], GeoEye-1 [57,66], RapidEye [5,62], and Gaofen-2 [64,67].

High (Hyper-) Spectral Resolution Sensors.
During the last three decades, various hyperspectral remote sensing (HRS) sensors/systems, onboard aircraft and satellite have provided new remote sensing data for classifying TS with a focus on utilizing subtle spectral information. The most popularly used HRS sensors for TS classification are summarized in Table 2, which mainly include airborne sensors (AVIRIS, CASI, HYDICE, and HyMAP) and satellite sensors (Hyperion and CHRIS). Researchers have demonstrated the capability of the HRS sensors' data used for identifying and mapping forest TS and species composition and have achieved a great success.
Compared to satellite HRS sensors' data used for mapping TS, it is more useful and important to use various airborne HRS data for TS classification [10,33]. Buddenbaum et al. [14] mapped coniferous TS with HyMap data using geostatistical methods and achieved a classification accuracy of OA = 78%. In mapping urban forest species, Xiao et al. [68] utilized AVIRIS image data to successfully discriminate between three forest types with an OA of 94% and an OA of 70% for identifying 16 TS with the data. Alonzo et al. [41]   5 Journal of Remote Sensing used the AVIRIS image data coupled with LiDAR data to map 29 common tree species in Santa Barbara, California, USA. They produced both species-level and leaf-type level maps with OAs of 83.4% and 93.5%, respectively. Liu et al. [42] also used combined CASI HS data and LiDAR data to map 15 common urban tree species in the City of Surrey, British Columbia, Canada. Their mapped results indicate OAs of 51.1%, 61.0%, and 70.0% using CASI, LiDAR, and the combined data, respectively. There are only a few studies that directly use satellite HS sensors' data for mapping TS due to fewer satellite HS sensors in operation (e.g., [16,69,70]). By using Hyperion imagery for classifying tropical emergent trees in the Amazon Basin, Papeş et al. [71] concluded that when using 25 selected narrow bands and considering pixels that represented >40% of tree crowns only, the classification was 100% successful for the five taxa. Dyk et al. [72] tested multitemporal CHRIS image data for mapping forest species (three dominant species) and stand densities (five density classes). The mapping accuracy (OA) reached about 90%. Similar works using airborne HS sensors' and satellite HS sensors' data ( Table 2) for TS classification have been done in [15,17,36,[73][74][75][76][77][78][79][80][81][82].

LiDAR Sensors.
The most basic data measured by LiDAR sensors are the distance between sensors and targets. There are two kinds of LiDAR systems. If the system records the reflected energy at every distance, it is called fullwaveform LiDAR. If the system only records the XYZ coordinates of the first and last energy peak, then it is a discretereturn LiDAR. Most commercial LiDAR systems are discrete-return systems. There are also recently developed two or three wavelength LiDAR systems [25]. In forest inventory and research, the LiDAR data are commonly used to generate forest structure parameters and analyze vertical structure properties. In addition, LiDAR sensors can also measure reflected energy from targeted surfaces and record features of the reflectance spectra such as amplitude, frequency, and phase [83].
Although LiDAR-derived height information alone may also be used to discriminate TS, most successful applications of LiDAR data for TS classification combine optical VHR image data, which have been demonstrated in many studies (e.g., [42,47,48]). Researchers have reported TS mapping studies through integrating the VHR satellite images with airborne LiDAR data to demonstrate the additional ability of LiDAR data in improving TS mapping accuracy. For example, to map 15 common urban tree species in the City of Surrey, British Columbia, Canada, in evaluating the potential of LiDAR and CASI data, Liu et al. [42] obtained OAs of 51.1%, 61.0%, and 70.0% using CASI, LiDAR, and the combined data, respectively, suggesting that if LiDAR is used alone, the TS mapping accuracy is limited, but if LiDAR is combined with other optical sensors' data, the mapping accuracy might be significantly improved. By combining high spectral and spatial resolution optical data with LiDAR-derived data (tree height and standard deviation of tree height within tree crowns), Voss and Sugumaran [84] and Alonzo et al. [41] concluded that the accuracy of mapping urban TS was increased significantly (increasing 19% and 4.2%, respectively). The reason why the combination of VHR optical sensors' data with LiDAR data improves TS identification is because of a synergy of VHR data offering sufficient spectral and spatial/textural information and LiDAR data providing vertical profile/structural information, which together should be helpful for TS classification.

Unmanned Aerial Vehicle-(UAV-) Based Sensors.
Recently, unmanned aerial vehicle-(UAV-) based remote sensing systems represent a low-cost, flexible, and autonomous opportunity, and thus they may be an alternative platform to satellites and aircrafts for forest inventory and research [85][86][87][88][89]. Specifically, over satellite and airborne remote sensing data acquisition techniques, there are many advantages with the UAV-based techniques, including (1) the possibility to collect remote sensing data under undesirable imaging conditions, e.g., under cloud cover; (2) the costefficient data collection with the desired spatial, spectral, and temporal resolutions; and (3) the unrestricted operational area from which the UAV systems can take-off [90]. Usually, given a lower flight altitude than conventional aerial platforms, the UAV-based systems offer a finer spatial resolution platform [86]. UAV-based sensors/systems can not only provide two-dimensional (2D) image data at high spatial/spectral resolutions but also offer 3D data that are created from the overlapped UAV-based photogrammetric point clouds and digital surface model (DSM). For mapping TS from UAV-based data, different image features might be used to quantify characteristics of different TS [91].
The capability of VHR images (including 2D and 3D) that the UAV-based systems can offer provides an opportunity and potential for improving mapping individual TS. For example, in mapping ten urban tree species using UAVbased RGB optical images and deep learning methods, Zhang et al. [92] achieved an OA of 92.6%. Schiefer et al. [20] also used the UAV-based RGB imagers (<2 m resolution) to assess the potential of VHR RGB imagery and convolutional neural networks (CNNs) for mapping TS in temperate forests. Given that air-/space-borne remote sensing sensors currently cannot provide comparable spatial resolutions, these research results highlight the key role that UAV systems can play in accurately mapping forest TS. It has been demonstrated that UAV-LiDAR data and 3D UAV-based VHR images can be used to extract accurate tree height information at both single trees and forest stand levels (e.g., [93,94]) and thus improve TS classification. Cao et al. [95] separated different mangrove species using a combination of UAV-based HS image and height data from UAV-derived DSM. They indicated that the tree height information extracted from DSM is effective for identifying mangrove species. In mapping 12 major tree species in a subtropical forest area in Southern Brazil, Sothe et al. [90] tested the potential of using a combination of UAV-based photogrammetric point cloud data with HS data and achieved an OA of 72.4% with an optimal feature combination as input. Similar studies with UAV-based VHR 2D and 3D images to identify or map TS were conducted by many other researchers, such as in [91,[96][97][98][99][100]. However, when considering the advantages of UAV-based systems over

Techniques and Methods of Mapping
Tree Species 3.1. Reference Data Collection. Reference data, commonly called training and test/validation data, are necessary for reliable and accurate tree species (TS) classification and mapping projects. To select and collect effective reference data, certain criteria have to be followed to ensure that they can achieve objectives or solve questions or test hypotheses under investigation. Based on statistical requirements and on Fassnacht et al.'s review [21], there may be five criteria we need to follow: (1) Based on the TS classification unit (pixel based or object based) and variation within individual TS, the sample size of reference data has to meet requirements of minimum number per species and balanced number of samples across individual TS. (2) The selected samples have to be fully representative of their populations and spatial extent under investigation. (3) The distributions of the selected reference data should match the underlying assumptions of the applied methodology.
(4) The observation errors of the reference data should be known and potential influence on the final results of TS classification should be discussed. (5) All samples in the reference data should not be spatially overlaid.
According to the five basic criteria, there may be two approaches we can adopt to collect the reference data: filed approach and interpretation/delineation approach with high-resolution images.
3.1.1. Field Approach. By referring to the TS classification unit and a selected sampling method (random or systematic or typical method), samples (plots) are first deployed in the study area, then a set of tree parameters are measured and recorded by means of necessary instruments (e.g., GPS) and tools (e.g., altimeter) from each plot. If the samples are collected by basically following the five criteria above, the reference data selected and collected using the approach are "real ground truth" data and should provide a concrete accuracy assessment for the estimated parameters (e.g., TS classification). However, due to the time limitation, human energy, and study area accessibility, it is not impossible that only a small sample size can be guaranteed [22]. In most published studies on TS mapping with remote sensing image data, the referenced data, collected in the field, may include tree (species) position, tree height, crown size, and DBH [34,48,[103][104][105][106][107]. Usually, with the field approach, the tree position is usually measured using a GPS with a relatively high spatial accuracy, at the center of the tree stem (e.g., [108,109]) or directly marked on a high-resolution hard copy image (e.g., [19]). In addition, the recent terrestrial laser scanning and UAV-based remote sensing techniques may be used for collecting reference data, such as tree species and their 3D properties.

High-Resolution Digital/Imaging Interpretation
Approach. If a high-resolution image (either digital or hard copy) is available and a person who is a visual interpreter is able to identify TS on the image and familiar to the study area, the reference data can be directly selected and delineated from the image by the imaging interpretation approach. The reference data collected with the approach may be called "pseudo-ground truth" data but can function as the real ground truth data. The reference data for tree position can be directly delineated based on either a VHR optical image [110][111][112][113] or a LiDAR-derived CHM image [114,115]. Compared to the field approach, the image interpretation approach is relatively convenient to perform, and, usually, a large sample size for training and validation is guaranteed. However, the quality of reference data collected by using the approach is dependent on the quality of the high-resolution image (spatial resolution better than 1 m) and a personal experience to interpret individual TS in the study area. Thus, it is necessary and better to collect some field data for assessing the reliability and accuracy of the reference data delineated/interpreted from an image (optical or CHM). Therefore, in many studies, researchers adopted a so-called hybrid approach to collect a qualified reference data set, which first adopts the image interpretation approach to collect a set of reference data, then brings the delineated reference data to the field to check/verify using the field approach (e.g., [77,109]). With the hybrid approach, some errors from the intrinsic georeferenced high-resolution images and personal ability of image interpretation can be avoided or alleviated to a certain level after field checking and verification.

Data Fusion.
Data fusion in the remote sensing community is defined as a process dealing with a combination of single sensor's data at different levels or multisensor's images to form a new image to achieve refined/improved information for decision-making [18,116]. For decisionmaking for TS classification, the data fusion is commonly performed at two processing levels: pixel and feature (image object (IO)) [117]. The literature review shows that a large number of examples of image fusion are at a pixel level. The image fusion at a feature level requires the generation of IOs first by using an image segmentation process, which can be realized in the different sensors' images. Then, features corresponding to characteristics reflected from the different sensors' images are extracted for each IO. According to the literature review for this paper, most popular data/image fusion methods for improving TS mapping may include the following three categories: (1) spatially sharpening single sensor at different levels or multisensor's images, (2) fusion with different sensors' images or multisource data, and (3) spatiotemporal data fusion methods. Table 3 summarizes the characteristics, advantages/limitations, and major factors of each category of methods. The first two categories of data fusion methods have been demonstrated to be effective in improving mapping TS by researchers, such as in [19,112,118,119]. One high-resolution pan band and several lowresolution MS bands [59,112] Sharpening with two different sensors' data Two sensor's images cover the same spatial area at high and low resolutions. The low-resolution image has to be resampled to a higher resolution so that the two images have the same size before running a sharpening algorithm (e.g., PCS or GSS Sharpened image has a nominal high resolution but retains multispectral property.
Two images need to be spatially registered and the low-resolution image must be resampled to a higher resolution Two-sensor data registered together and making the two images have the same size [120] Fusion methods with different sensors or multisource data Optical sensor and optical sensor's data Different optical sensors provide different spatial resolution band image and cover different spectral regions, but both images cover the same spatial area Complement data sets from different spatial resolutions and spectral regions for improved classification. A good spatial registration between the two sensors' data is needed Resampling low to high resolution and a good spatial registration needed between them [10,55] Optical sensor and LiDAR data Optical sensor's data offering sufficient spectral and spatial/textural information and LiDAR data providing vertical profile/structural information Complement data sets for improved classification. A good spatial registration between the two sensors' data is needed A good spatial registration is needed between the two sensors' data [19,47,49] Optical sensor, LiDAR, and ancillary data Optical or LiDAR sensor's data offering sufficient spectral and spatial/textural information or vertical profile/structural information and ancillary data providing information related to TS spatial distribution Complement data sets for improved classification. Digitize the ancillary data and a good spatial registration between the two sources is needed A good spatial registration is needed between the two sources' data [56,57,59] Spatiotemporal data fusion methods with different resolutions (spatial and temporal) in sensor data STARFM-based methods Based on the assumption that the ratio of coarse pixel reflectance to neighboring similar pixels does not change over time A type of widely used STDF models and preserving spatial detail information; fail to address a heterogeneous landscape and to predict short time change events Determine if it is temporal variance or spatial variance dominated area [121][122][123] Spectral unmixingbased methods Based on the assumption that the reflectance of each coarse spatial-resolution pixel is a linear combination of the responses of all endmembers within the coarse pixel Easy-to-obtain endmembers by grouping similar pixels; may fail to catch short temporal interval events due to using 1 or 2 high-resolution images to cluster similar pixels Cluster high-quality similar pixels from high-resolution images for extracting endmembers [124,125]  This process is mostly done at a pixel level. Such a sharpened image has a nominal pan high resolution but its MS property may be slightly different from the original MS property. In segmenting images for creating optimal tree crown IOs, the use of the sharpened image can improve the quality of IOs that match individual tree crowns compared to directly using an MS low-resolution image to create the corresponding IOs [118]. With multisensor images (Landsat TM, ALOS AVNIR-2, WV2, and LiDAR), to map five levels of mangrove features including species communities, Kamal et al. [49] compared and contrasted the ability of these sensors' images and concluded that the pansharpened WV2 and LiDAR images could be used to map detailed individual mangrove tree crowns and mangrove species communities. By using pansharpening dry and wet two seasonal WV3 images, Ferreira et al. [112] conducted a comprehensive assessment of the two seasonal images for eight tree species identification in tropical semideciduous forests. They concluded that the use of texture analysis, pansharpening, and individual tree crown (ITC) delineation is a potential approach to perform TS mapping in tropical forests with WV3 data. Korznikov et al. [66] utilized a pansharpened RGB image by GeoEye-1 at a spatial resolution of 0.46 m to accurately discriminate TS. With two different optical sensors' data (usually one with high spatial resolution and the other with MS low spatial resolution) and spatial-sharpening algorithm (e.g., PCS and GSS), a spatial-sharpening image may be created. To perform the process, low-resolution MS band imagery need to be resampled to a higher resolution to make the two sensors' images have the same size first, then the same pansharpening method above is performed to create a spatialsharpened image. For example, when assessing the MS spectral property of a spatial-sharpening image created with different sharpening algorithms (Hue-Intensity-Saturation, PCS, and High-Pass Filter), Chavez et al. [120] fused Landsat TM MS bands with SPOT Pan band to create a 10 m resolution sharpened image. Although no studies directly using the spatial-sharpening method to improve TS mapping are found from existing literature, given the potential of improved spatial resolution and retained MS spectral property, the sharpening technique should be useful to improving TS mapping in practice.

Fusion with Different Sensors or Multisource Data.
In general, there are three types of methods that may be used for data fusion with different optical sensors' images, optical and LiDAR sensors' images, and optical and/or LiDAR sensors' and ancillary data, respectively. In improving TS mapping practices, different resolutions (mostly spatial but also spectral) of optical sensors' images are often combined to improve TS classification results that are often much better than those created with their individual sensor's data. For example, Hauglin and Ørka [55] tested the ability to identify two tree species in the same genera by using Landsat-8 OLI images, aerial images, and LiDAR data. Their tested results indicate that the OA accuracy varied from 0.53 to 0.79, with the highest accuracy created using a logistic regression model with a combined data derived from Landsat OLI imagery and aerial photos. In mapping forest species composition and tree plantations in Northeastern Costa Rica, Fagan et al. [10] demonstrated the potential of the combination of hyperspectral (HS) HyMap with multitemporal Landsat TM imagery to accurately classify (1) forest types and (2) tree plantations by species composition. Their results indicate that the combination of occasionally acquired HS data with widely available multitemporal Landsat imagery enhanced mapping and monitoring of reforestations in tropical landscapes.
In this literature review on TS classification with remote sensing images, a lot of studies on optical sensors' images combined with LiDAR data to improve the accuracy of TS classification are observed. Usually, in this case, the optical sensor's data can offer sufficient spectral and spatial/textural information while LiDAR data can provide vertical profile/structural information, a combination of which should be beneficial to improving TS classification. With the point cloud data, it is widely used to produce digital surface models (DSM) and digital terrain models (DTM) with high accuracy, and then, to create absolute height data (e.g., a canopy height model (CHM)) for segmentations (e.g., individual ITCs) and classifications (e.g., for TS) by subtracting DTM from DSM. In practice, such CHM data, coupled with other optical sensors' data derived spectral/spectral features, are very helpful to improve TS mapping. For instance, to map 29 common urban tree species in Santa Barbara, California, USA, Alonzo et al. [41] fused high spatial resolution (3.7 m) HS AVIRIS imagery with LiDAR data at the ITC object level. The additional information of LiDAR data led to an increase in TS classification accuracy of 4.2% over spectral data alone. When Zhang et al. investigated the effectiveness of TS classification of urban forests by integrating airborne-based HyMap HS imagery and LiDAR data, their results proved that individual TS discrimination in urban forests can be achieved with the combination of objectbased LiDAR segmentation of tree crowns and HS spectral signatures. There are more studies found in the literature, including [61,82,119,126].
With the aid of ancillary data, such as DEM-derived topographic information, the accuracy of TS classification using an optical sensor's and/or a LiDAR sensor's data can be improved. This may be due to a synergy of sensors' data, which provides with spectral and spatial/textural information and/or vertical profile/structural information, and ancillary data offering information (maybe related to TS geographic distribution), which together should be beneficial to TS mapping. For example, Lim et al. [57] developed a model by using multiple optical sensors (Sentinel-2, GeoEye-1, and WV2/3) and GIS data (topographic information) to map the five dominant tree species in North Korea. The results show high TS classification accuracies of 91% (in 9 Journal of Remote Sensing the model calibration area) and 90% (in test area), respectively, suggesting the model's potential applicability throughout the Korean Peninsula and a broader region. To map 11 forest stand species with machine learning algorithms and integrating multitemporal Sentinel-2 products and DEM data at a regional scale, Grabska et al. [56] extracted spectral features from the Sentinel-2 products and four topographic variables (elevation, slope, aspect, and distance to water bodies) from the DEM data. Their results demonstrate the potential of Sentinel-2 imagery and topographic data applied in mapping forest tree species in a large mountainous area with high accuracy (OA > 85%). Ke and Quackenbush [59] also proved that ancillary topographic data were much helpful in improving forest TS mapping and tree crown delineation using Quickbird imagery.

Spatiotemporal Data Fusion
Methods. It is ideal for mapping and monitoring surface biophysical parameters (e.g., evapotranspiration, NDVI, vegetation, and crops) if we have both high spatial and temporal resolution remote sensing data. However, a single satellite sensor cannot provide the ideal data due to the trade-off between spatial and temporal resolutions. During the last two decades, several spatiotemporal data fusion (STDF) methods have been developed and used to create high spatiotemporal resolution remote sensing data, which potentially address the challenge. Usually, the STDF methods can be efficiently used to fuse relatively high temporal but low spatial resolution data (e.g., MODIS and AVHRR) with relatively low temporal but high spatial resolution data (e.g., Landsat and Sentinel) to obtain a new data set at both high spatial and temporal (continuous) resolutions. There are two general categories of STDF methods: (1) methods based on the spatial and temporal adaptive reflectance fusion models (STARFM) and (2) methods based on unmixing theory [127]. The first category assumes that the ratio of coarse pixel value (radiance or reflectance) to neighboring similar pixels does not change over time, while the second one assumes that the value of each coarse spatial-resolution pixel is a linear combination of the responses of all endmembers within the coarse pixel. The STARFM [121] is one of the most widely used STDF models, which fuses the MODIS daily surface reflectance data and the 16-day Landsat surface reflectance data to generate a synthetic daily surface reflectance Landsat-like data. Since the STARFM might not be very sensitive to heterogeneous landscapes, many improved modes were developed, such as ESTARFM [122] and FSDAF [123]. The ESTARFM can improve the data fusion accuracy in heterogeneous landscapes, while FSDAF is suitable for heterogeneous landscapes and can maintain spatial details within a scene of image [128]. The unmixing-based STDF methods may include ISTBDF (improved spatiotemporal Bayesian data fusion) [125] and STDAF (spatial temporal data fusion approach) [124]. The second category of methods can obtain an accurate prediction of surface reflectance by grouping similar pixels and calculating the average reflectance.
Based on the literature review on applications of the STDF methods, it was found that most studies were not directly for improving TS classification but rather for mapping and monitoring of other surface biophysical parameters with the synthetic data by mostly fusing MODIS-Landsat data to create a high temporal resolution Landsat-like data (30 m) series. However, given the potential of the STDF methods to create both high-resolution series data able to characterize a tree ITC size (spatial scale) and phenology (temporal dynamic), the STDF methods should be used to further improve TS identification and mapping if the synthetic data series is available at a spatial resolution better than 5 m.

Feature Types and Selection
3.3.1. Characteristics and Types of Features. In classifying and mapping TS with various remote sensing sensors' data, a set of input features or variables (also called explanatory, independent, predictor variables) for calibrating/fitting a variety of models need first to be determined and extracted from sensors' data. Generally, there are five types of features that can be determined and extracted from different optical and LiDAR sensors' data for developing various models to classify and map TS (Table 4). They include (1) spectral bands, (2) spectral/vegetation indices, (3) transformed components or vectors, (4) spatial/textural features, and (5) LiDAR-based vertical/geometric and intensity features. The characteristics, sources, and typical references of the five feature types are summarized in Table 4. Spectral bands in digital number or reflectance can be directly selected from different optical sensors' images and used in different TS classification models. There are a lot of studies on TS classification and mapping that directly use original spectral band features, such as selecting original band spectra from multispectral (MS) sensors' data by [13,129,130] and from hyperspectral (HS) sensors' images by [42,95].
The second type of feature is spectral vegetation indices, which can be constructed using two or more spectral bands through an arithmetic operation of a ratio or a normalized difference ratio, or other forms. Based on application purposes, advantages of these VIs may include easy to use and normalizing background effects on target (e.g., tree canopy) spectra, e.g., reducing atmosphere and soil background effect on tree canopy spectra [26], and hence the VIs usually function much better than their individual spectral bands in remote sensing applications. VIs may be divided into two groups: one constructed with MS sensors' data and the other group with HS sensors' images, and both groups of VIs have been selected and used in most of the studies on mapping TS. For example, Liu et al. [42] demonstrated that the anthocyanin content index and photochemical reflectance index were the most important HS features for the identification of TS in the spring season when they mapped urban tree species by integrated airborne HS image and LiDAR data. Xu et al. [131] reported that the integration of time-series GF-1 NDVI images, vegetation growth status, and feature of canopy information simultaneously will be beneficial to mapping TS at a landscape scale.
When the high dimensional source (e.g., HS sensors) data are used for TS classification, the transformed 10 Journal of Remote Sensing Typical MS VIs can be found from Table 1 from [132] and Table 2 from [118] Hyperspectral (HS) VIs Using 2 or more HS bands to construct a ratio or normalized difference ratio or other forms of VIs by an arithmetic operation Typical HS VIs can be found from Transformed feature Principal component analysis (PCA) A linear combination of high-dimension raw data to reduce dimensionality and preserve variance contained in raw data as much as possible in the first several PC images. Usually, the 1st several PCs are adopted [133,134] Minimum noise fraction (MNF) A linear combination of high dimension raw data to reduce dimensionality and preserve minimal noise or maximal signal-to-noise ratio in the first several MNF images.
Usually, the 1st several MNFs are adopted [70,90] Canonical discriminant analysis (CDA) Search for a linear combination of independent variables to achieve maximum separation of classes (populations). Usually, the 1st 2-5 canonical variables are adopted [15,41,135] Wavelet transform (WT) Decompose spectral signals with scaled and shifted wavelets. The energy feature of decomposition coefficients is computed at each scale and is used to form an energy feature vector that can serve as a feature extraction through a dimension reduction [75,76,136] Spatial/textural feature Textural features A small area has a wide variation of discrete tonal features, and the dominant property of that area is texture. Texture features are based on 1stand 2nd-order gray-level statistics. The definitions of textural features are slightly different between pixel-based and IO-based Typically, pixel-based 1st order textural features have 5, and 2nd-order textures have 8. This can be found in Table 2 from [137], IO-based textural features can be seen in Table 2 from [118].
Spatial/geometric feature A feature describes a shape or geometric form of IO, so a spatial/geometric feature is extracted for IO-based analysis only See definitions of typical spatial/geometric features, extracted from IOs, in Table 1 from [138] LiDAR-derived feature Vertical/geometric features Vertical/geometric features related to tree height and tree crown shape, usually extracted from normalized LiDAR data Typical vertical/geometric features extracted from LiDAR data can be seen in Table 1 from [25] Intensity features The maximum energy of the backscattered echo, representing the reflectivity of every point measured by the laser scanner as an intensity. They are single-channel and multichannel intensity features derived from LiDAR data Typical single-channel/multichannel intensity features derived from LiDAR data can be seen in Tables 2 and 3 from [25] . 11 Journal of Remote Sensing components or vector features are usually adopted. Such transformed features often have characteristics of low dimensionality and less or no redundancy or collinearity among selected features. In mapping TS with different remote sensing sensors' images, the commonly used transformation algorithms consist of PCA, MNF, CDA, and Wavelet transforms. For example, in discriminating between ten major urban tree types with HS sensor AISA imagery, Jensen et al. [133] used features (PCs, VIs, and band means) that were extracted or transformed from the AISA imagery, and achieved an OA of 82% when only PCs were used. Leutner et al. [134] also calculated PC images from Hyperion HS imagery and used the type of features for mapping forest diversity and floristic composition with LiDAR-derived features. Zhang et al. [77] studied the potential of TS classification of urban forests using the HyMap HS image and LiDAR data. They conducted an MNF transformation to HyMap data to allow reducing the data to 20 significant MNF components out of 118 original bands to increase the effectiveness of the TS classification. Alonzo et al. [15,41] produced canonical variables transformed from AVIRIS spectral bands and LiDAR-derived structural metrics with the CDA transform method for identifying and mapping Santa Barbara's urban tree species and proved the effectiveness of the canonical variables in mapping TS. Zhang et al. [76] calculated Wavelet energy vectors from HYDICE HS imagery to examine crown-level spectral variation within and between tropical tree species at La Selva, Costa Rica.
Texture features are based on firstand second-order gray-level statistics. The definitions of textural features are slightly different between pixel-based and IO-based classifications. A spatial feature extracted from optical sensors' images may be described as a shape or geometric form of an IO, so a spatial/geometric feature is extracted for the IO-based analysis only. Since most optical sensors' (MS or HS) images used for mapping TS or species composition have a high spatial resolution, the firstand second-order gray-level textures are available for extraction and used for both pixel-based and IO-based classification of TS. Highresolution optical sensors' images can record textural/contextual information related to the characteristics of a leaf (e.g., shape, size, and arrangement), branch (e.g., orientation), crown (e.g., shape, size, coverage), and canopy (e.g., pattern) of individual TS, so most of studies on identifying and mapping TS with high-resolution sensors' data select and use various textural/spatial features in either pixel-based or IO-based TS mapping projects (e.g., [13,38,95,139]).
LiDAR-based features consist of vertical/geometric and intensity's metrics. Given different characteristics of vertical/geometric structures and reflected spectral properties of individual TS, the features related to tree height and tree crown shape and to intensity from target surfaces (ground and tree crown/canopy), extracted from LiDAR data, should be helpful to mapping TS. The detailed list of LiDAR-based features (vertical/geometric and intensity) and their characteristics were presented and well described in [25], and the LiDAR-based features have been extensively used in mapping TS studies. Identifying and mapping TS with LiDAR-derived features and other types of features extracted from optical sensors' data in urban environments have been well reviewed in [24,83]. In other environments, there are a lot of studies on TS classification with the LiDAR-and other optical sensor-derived features in the literature, such as in an island [95], in a subtropical forest area [90], and in a forest farm [139].

Selection of Features.
Reasons or purposes of the feature selection may include lowering feature dimensionality, removing or reducing data redundancy and collinearity, and increasing efficiency of TS classification and mapping. The result of the feature selection is a subset of features selected from all features. Currently, the most popular feature selection methods may be divided into three types: (1) decision tree (i.e., classification and regression tree (CART)) or random forest (RF), (2) correlation-based feature selection (CFS), and (3) stepwise variable selection. With the ranked features with their importance values calculated by CART/RF algorithms, a subset of features is selected from all input features with a threshold. For example, Xie et al. [38] used the RF classifier calculated importance values of input features/variables to select a subset of importance features to efficiently map TS. As a traditional filter feature selection approach, CFS first calculates the feature-class (TS) and feature-feature correlation matrices using training samples, and then search for a feature subset using the best first search algorithm based on the redundancy between features [140]. To reduce the redundant bands of HS images to improve classifying mangrove species, Cao et al. [95] utilized the CFS method coupled with CART to select features. The CART could be used to identify spectral bands based on their importance values discriminating between species, while the CFS method was used to further select the most effective features for their study. Based on the principle of automated forward and backward stepwise feature selection, a stepwise discriminant analysis (SDA) method can be applied for selecting features for TS classification. The SDA algorithm, in general, minimizes the within-class (TS) variance while maximizing the between-class variance for a given significance level. Many researchers adopted the SDA method to select a subset of features from all input features to improve the efficiency of TS mapping, such as in [34,118,129,141]. Besides the selection methods discussed above, there are a certain number of traditional statistical test techniques used to test distance/separability between two classes (TS) that are also often used to select a subset of features, such as the Mann-Whitney U-test, the Jefferies-Matusita distance, and the Bhattacharya distance [142].

Tree Species Classification Methods and Algorithms
3.4.1. Pixel-/Object-Based Classifications. The pixel is a basic unit of various remote sensing imagery, and thus traditional classification methods use individual pixels as a classification unit, called pixel-based classification. The pixel-based classification analyzes the spectral signature over the pixel area. However, with the increase of spatial resolution of modern remote sensing sensors, the size of a pixel has gradually become smaller than that of an object [18]. However, 12 Journal of Remote Sensing LSM is a linear combination of spectral signatures of surface materials with their corresponding areal proportions as weighting factors. Endmembers (EM, e.g., TS) in LSM are the same for each pixel in an image Simple and provides a physically meaningful measure of TS abundance within mixed pixels. LSM cannot account for subtle spectral differences between TS; the max number of EM is limited by the number of bands Define and extract representative EM training spectra (TS) [153,154] Multiple endmember spectral mixture analysis (MESMA) The number of EMs (TS) are not limited by the number of bands and allowed to vary across pixels. A series of candidate 2-/3-EM models are evaluated for each pixel and an optimal model is finally adopted based on selection criteria The limitations in LSM are overcome. Calculate the spectral similarity between a reference and a test spectrum by calculating the "angle" between the two spectra. Treating the both spectra as spectrum vectors in a space of an MS or HS image This spectral similarity measure is insensitive to gain factors (illumination). SAM does not consider the variation within a pixel The imaging data have been reduced to "apparent reflectance" [14,82,158] k-nearest neighbor (k-NN) k-NN is a nonparametric method when the pixel/IO is classified by a majority vote of its neighbors and it is assigned to the most common class among its k -nearest neighbors Simple and effective, and it is appropriate for those samples that cross multiple classes but it can be highly affected by the representatives of training samples for each class Determination of k's value [96,146,159] Advanced classifiers/models (machine learning (ML) methods)

Random forest (RF)
Nonparametric classifier for both feature selection and target classification; achieving satisfactory results depends upon the determination of the "best" tree structure and the decision boundaries Computationally fast, less sensitive to overfitting and output variables' importance. The "best" tree structure and the decision boundaries are not easy to "find." Might overfit in the presence of noisy data Find the "best" tree structure [17,34,160] Support vector machine (SVM) Nonparametric classifier to map data from spectral space into feature space, wherein continuous predictors are partitioned into binary categories by an optimal n-dimensional hyperplane Handle data efficiently in high dimensionality, deal with noisy samples in a robust way, make use of only those called support vectors to construct classification models. The mapping data procedure is relatively complicated. Selection of kernel function parameters Mapping data from the original input feature space to a kernel feature space [38,107,139] 13 Journal of Remote Sensing mapping TS by using the VHR imagery may pose a new challenge because the spectral response from a single tree crown (ITC) is affected by variation within the ITC illumination and background effects, and thus the accuracy is reduced by using traditional per pixel classifications [143]. To overcome this problem, a region-based or object-based classification was developed. The object-based image analysis (OBIA) technique first uses image segmentation to generate discrete regions or image objects (IOs) that are relatively homogeneous within themselves than with nearby regions (e.g., an ITC), and then these IOs rather than pixels are used as a classification unit [144,145]. An IO-based classification method can potentially improve classification accuracy compared to a pixel-based method, which may be explained by the following four reasons [118,138,146]: (1) Partitioning an image into IOs is similar to a procedure where humans conceptually organize a landscape to comprehend it. (2) Besides spectral features available in a pixel-based method, texture and contextual features and shape/geometric features associated with IOs can be utilized. (3) Objects of interest to be extracted can be related to different extraction levels (i.e., different scales), and these levels may be represented in an analysis system, such as a multilevel system method that will be evaluated in Section 4. (4) Environmental variables, such as the terrain features of elevation, slope, and aspect, may also be used. It is worth noting that the OBIA method is mostly suitable for VHR data, such as IKONOS and WV2 but not for moderate or coarse resolution sensors' data.
Existing studies on mapping TS composition have demonstrated the advantages of OBIA methods (e.g., [59,74,84,118,[147][148][149][150][151][152]). For example, both studies of Kim et al. [150] and Immitzer et al. [149] obtained a 13% higher OA with OBIA by using an optimal segmentation of VHR image data (IKONOS and WV2) to identify (7 and 10) TS/stands compared to the pixel-based method. Shojanoori et al. [152] also utilized WV2 imagery and the OBIA classification method to map three urban TS. They concluded that the IO-based classification results were much better than those created with the pixel-based classification and improved OA accuracy could be up to 16.54%.

Spectral Mixture Analyses.
Compared with an ITC size, many optical remote sensing sensors' images have a relatively low spatial resolution, such as Landsat series and SPOT series MS images. Based on this case, several ITCs may be mixed in a single pixel. Therefore, to effectively map TS and estimate species abundance from those moderate-resolution sensors' images, it is necessary to conduct a spectral mixture analysis to estimate individual TS abundance within the mixed pixels. To map TS abundance with the relatively low-resolution images, a linear spectral mixture model (LSM) is adopted. In LSM, the spectral signature of a mixed pixel is assumed to be a linear combination of spectral signatures of surface materials with their corresponding areal proportions as weighting factors [153]. Currently, since the LSM method is easy to use, it has been widely and successfully applied for mapping abundance of individual TS with optical sensors' images. For instance, Liu and Wu [154] calculated pixel-weighted representative tree crown spectra using the LSM model with HS imagery and proved that the pixel-weighting approach is effective to automatically classify four TS at crown level. However, the LSM model has three limitations: (1) The number of endmembers is the same for each pixel in an image.
(2) LSM cannot efficiently account for subtle spectral differences among endmembers. (3) The maximum number of endmembers is limited by the number of bands. Therefore, Roberts et al. [155] developed a multiple endmember spectral mixture analysis (MESMA), which overcomes the limitations of the simple LSM. Using the MESMA, the number of endmembers is not limited by the number of bands and is allowed to vary for each pixel in the image. The characteristics and advantages/limitations of the LSM and MESMA algorithms are summarized in Table 5. The MESMA model has been used in a variety of environments for vegetation mapping including TS mapping. For example, Roberts et al. [155,156] used the MESMA with AVIRIS HS imagery to classify vegetation species and land cover types in southern California chaparral, USA. Be able to estimate the properties (patterns and trends) of data based on limited training samples. The nature of hidden layers is poorly known and it takes time to find a set of ideal structure parameters Test and find a set of ideal architecture parameters [38,92,107] Convolutional neural networks (CNNs) CNNs are constructed by neurons and links with learnable weights and biases. Each neuron receives a weighted sum of several inputs, passes it through an activation function and responds with an output. A common CNN architecture consists of an input layer, stacks of convolution and pooling layers, a fully connected layer, and an output layer Efficiently processing multidimensional images and often leading to better classification results compared to other method classifiers. The model structure is complex and corresponding tools/ software for different CNN architectures are often not available and accessible Build the convolutional layer [107,119,161] 14 Journal of Remote Sensing

Traditional Classifiers/Methods.
In early studies on remote sensing-based TS discrimination, the most commonly used traditional classification methods/techniques included parametric classifiers (maximum likelihood classifier (MLC), linear discriminant analysis (LDA), and logistic regression (LR)) and nonparametric classifiers (spectral angle mapper (SAM) and k-nearest neighbor (k-NN)) (e.g., [30,73,74,79,157]). These methods were relatively easy to use and corresponding tools/software were available and accessible in commonly used image processing software packages. During the last decade, although these traditional methods/techniques could also be directly used to classify TS with selected features (e.g., [41,118]), they were mostly used as standard or reference methods in TS classification to indicate the advance and innovation of newly developed methods and algorithms. For example, the performance of the traditional classifiers of MLC, LR, and k-NN was compared to those of advanced classifiers random forest (RF) and support vector machine (SVM) [38,55,95,96,149,151]. Table 5 summarizes the most frequently applied classification methods and algorithms including the traditional and advanced ones and some of their characteristics, advantages/limitations, and major factors discussed in reviewed studies.
3.4.4. Advanced Classifiers/Models. After 2010, while traditional MS classifiers (e.g., MLC, LDA, and k-NN) were still used in mapping tree species with features extracted from optical sensors and LiDAR data, more advanced algorithms/methods have been developed and received more attention due to their improving TS classification and map-ping. The representative advanced methods/techniques may consist of two categories: machine learning (ML) methods and convolutional neural network-based methods (i.e., deep learning methods) ( Table 5). The ML models may include RF, SVM, and artificial neural network (ANN), which are all nonparametric classifiers and can take multiple input variables/features (e.g., spectral, textural, spatial/geometric, VI, and other ancillary variables) [17,34,38,160]. The convolutional neural network-(CNN-) based methods include different architectures/models of CNNs, such as AlexNet, VGG-16, ResNet-50 [3,92], and 3D-CNN [107,139]. More recently, the CNN models have been widely used in image classification tasks including TS classification (e.g., [3,119,161]). Usually, the advanced methods can result in higher accuracy of image classification compared to other classification methods. This trend is associated with modern computational capacities in hardware and free tools/software available in open-source environments, such as R and Python.

"Multiple" Method Development in Tree Species Mapping
During the last decade, a trend of "multiple" method development in tree species (TS) mapping and identification with different remote sensing sensors' images was observed. The "multiple" in "multiple" methods here means multisensor, multitemporal (season), and multilevel classification system methods. Such "multiple" methods for mapping TS, compared to "single" methods, frequently produce a higher accuracy of classifying and mapping TS. Table 6 summarizes the Table 6: Summary of "multiple" methods currently used for tree species (TS) mapping.

Method Characteristic and description Advantage and limitation Major factor Example
Multisensor method Integrating multiple sensors' images (different optical sensors' (satellite, aircraft, and UAV based) combination or optical sensor(s) combined with LiDAR) is a synergy process of different sensors' images that can provide different spatial, spectral, band setting, textural, and geometric information, which offers a new potential in improving TS classification and mapping Multiple sensors' images enable improving TS mapping accuracy by an efficient combination of different spectral, spatial, and textural/ geometric features, thus leading to higher classification accuracy. Requires complex image processing (e.g., data fusion techniques) and has high cost for using some sensor(s) data Resampling and registration as well as normalization processes needed between different sensors [10,41,57,162] Multitemporal method Using two or more seasons' (dates) sensor's images to classify TS to be expected to increase TS mapping accuracy. Can select an optimal seasonal data from multiseasonal images or two or more seasonal image combination(s) for TS classification Multitemporal remote sensing image data are just to align the image acquisition times with the phenological cycles of different TS under investigation. Multiseason image acquisition usually comes with higher costs and greater image processing demands Resampling and registration as well as normalization processes needed between different seasonal images [34,40,160,163] Multilevel classification system method A hierarchical spatial organization of objects (classes) in a landscape or image scene from a larger landscape/ cover type unit into the smaller objects or component units (e.g., TS) Match the logical structure of surface cover classification strategies and enhance the relative spectral and textural differences among the similar cover classes at a higher level. Need more computing power and determine thresholds at different levels Define and determine thresholds at different levels [34,49,68] 15 Journal of Remote Sensing characteristics, advantages and limitations, major control factor(s), and sample studies of the three categories of "multiple" methods.

Multisensor Methods.
The multisensor method can be described as the use of multiple sensors' image data by efficiently integrating different spatial, spectral, band setting, textural, and geometric information, offered by different optical sensors and LiDAR. Based on the literature review, the method can offer a new potential in improving TS classification and mapping compared to using a single sensor's data. However, the method often requires a complex image processing chain (e.g., data fusion algorithms) and a high cost for using some sensor(s) data (e.g., VHR satellite images and LiDAR data). This trend is attributed to many sensor data sources that are for free or a low cost to acce and a significant improvement of computing power in data fusing, processing, and modeling from various sources [24]. The multisensor method may be further divided into two types: (1) multiple optical sensor method and (2) optical sensor(s) and LiDAR data combination method. The multiple optical sensor method is related to the integrated use of different sensors' images (satellite-, aircraft-, and UAV-based MS and HS sensors) at different spatial, spectral, and temporal resolutions, while the optical sensor(s) and LiDAR combined method is associated with different optical sensor(s) combined with the LiDAR sensor. For example, Fagan et al. [10] examined the potential of a combination of HS HyMap and multitemporal Landsat TM imagery to accurately identify (1) general forest types and (2) tree plantations by species composition. Their experimental results indicate that HS and multitemporal data could be effectively integrated to discriminate tree plantations from secondary forests and mature forests with an OA of 88.5%. By integrating different spatial resolution sensors' data (Sentinel-2, GeoEye-1, and WV3), Lim et al. [57] developed a model to map the five dominant TS in North Korea using machinelearning classifiers. The integrated model used for classifying forest types in Goseong-gun (South Korea) was relatively accurate (80%), and thus, the model might be utilized to produce a map of dominant TS in Goseong-gun (North Korea).
The majority of studies with the multisensor methods belong to the 2nd type: optical sensor(s) coupled with LiDAR data. Due to strengths in providing spectral/textural details and 3D geometric features, moderate to highresolution satellite/airborne (including UAV-based) multi/hyperspectral imagery and airborne LiDAR data are suited for the estimation and mapping of forest structure, species, and species composition. The integration of optical sensors' images and LiDAR data allows us to classify TS more accurately in a complex environment. For instance, by fusing HS sensor AVIRIS imagery with high point density LiDAR data, Alonzo et al. [41] could clearly identify the 29 common urban tree species at crown level. With the AVIRIS imagery, the additional LiDAR data could help increase the OA of 4.2%, especially for small crowned species up to 10%. In a study on testing how VHR Geoeye-1 MS imagery and LiDAR point clouds allow for improving classification of five common urban TS, Roffey and Wang [162] obtained an OA of 85.08% when features extracted from both data sources were combined, compared to 77.73% and 71.85% when using LiDAR features and GeoEye-1 features, respectively. Dian et al. [164] also demonstrated that a combination of HS AISA imagery with LiDAR data to identify TS led to a higher accuracy when compared to using features extracted from AISA imager only.

Multitemporal
Methods. The review on phonologic and seasonal characteristics of tree plants in Section 2.1.1 indicates that it is possible to measure the phonologic characteristics of TS with multitemporal remote sensing techniques. Given the fact that the phonologic characteristics might be different among different TS, it is a good idea to use the multitemporal method with multitemporal remote sensing data to increase the accuracy of mapping TS. This is because tree species reflect phenology phenomena, and their phenology changes, which vary across different species [165], and the multitemporal remote sensing image data needed by the multitemporal method are just to align the time of image acquisition with the phenological period of the TS under investigation. Many reviewed studies have demonstrated that using the multitemporal method with multiseasonal remote sensing image data in improving mapping TS and species composition is significant (e.g., [13,34,40,118,151,160]). Previous studies using the multitemporal method to explore the potential of multiseason VHR satellite sensors were mostly limited to using only two-season data. For example, Li et al. [13] investigated the potential of twoseason WV2 (Sept. 14, 2012) and WV3 (Oct. 18, 2014) for mapping five dominant urban TS in Beijing, China. They concluded that the combination of the two-season VHR images could create a better mapping result than that with images from either dates. When Karlson et al. [160] investigated the capability of two-season WV2 imagery in classifying five dominant TS in West Africa, their results demonstrate that the integration of two-season images produced a better result than a single date image, with an OA of 83.4%. They also found that dry season imagery resulted in a better result (OA = 78:4%) than the wet season imagery (OA = 68:1%). Madonsela et al. [40] also proved that using the combined two-season images could improve TS classification, compared to using individual date images.
There are a few studies using the multitemporal method with more than two seasons (dates) image data in mapping TS. The kind of studies mostly focused on either selecting an optimal seasonal image or exploring the synergic power of combinations of two or three or all seasonal images under investigation in TS classification. For instance, Yang et al. [163] investigated the capabilities of using four seasonal WV3 images individually and in combinations in classifying individual TS, and their result indicates that using the mid summer image led to the highest OA of the individual TS classification. The synergistic use of the above three seasonal images substantially improved the classification accuracy. To understand the seasonal effect on TS classification accuracy, Pu et al. [34] also compared and analyzed the capabilities of multiseason Pléiades image data. They concluded that (1) a 16 Journal of Remote Sensing late spring season (April) image significantly increased TS classification accuracies compared to all other individual season Pléiades images (p < 0:01), and (2) combined drywet season images performed even better. It is worth noting that, under some certain circumstances, using two-season image combination for mapping TS may not always be better than using single data images, or using a combination with more seasonal images is not always more helpful than using a combination with less seasonal images. For example, when Ferreira et al. [112] assessed if the combination of images acquired from wet and dry seasons increases the classification accuracy of the TS; they found that the average producer's accuracy obtained by combined images was not higher than that using both seasonal images individually. When Pu et al. [34] tested the discriminant powers of combinations with two, three, four, and five seasonal images for mapping TS, they found that the combinations with more than two seasonal images did not produce a better TS mapping result than that created with a combination with two seasonal images (dry-wet season images). This may be because the useful phenology information extracted from the more than two seasonal images for mapping TS might be offset by the extra data, and contradictory characteristics of phenology induced biophysical and biochemical parameter' changes among multiple seasons [34].

Multilevel Classification System
Methods. Multilevel classification system [34,166] is a hierarchical spatial organization to organize objects (surface cover types) in a landscape or image scene from a larger landscape/surface cover type unit into smaller objects or component units (e.g., ITCs) at a higher classification level. The system method is also called stepwise masking system [68,167] or hierarchy classification system [49,168]. There are several advantages for the method: a logical sequential mapping process, a clear multiscale context of the targeted objects and their relationships, and a control over the process within a certain level and object container [49]. The use of the system method has demonstrated their potential of increasing the effectiveness of mapping TS (e.g., [49,68]). For example, in mapping 22 urban TS with HS AVIRIS imagery, Xiao et al. [68] designed a set of five masking levels to classify the urban TS. The five masking levels consisted of masking 1 (separating the study area into vegetated and nonvegetated areas), masking 2 (separating shrub, tree, and grass from the vegetated area), masking 3 (separating broadleaf and conifer from the tree area), masking 4 (separating evergreen and deciduous from the broadleaf area), and masking 5 (identifying TS from conifer, evergreen, and deciduous areas, separately). At the TS level, an average accuracy of 70% for classifying the 22 TS was obtained. Kamal et al. [49] utilized the object-based multilevel (scale) mangrove mapping approach successfully to map mangrove attributes including species communities. This effectiveness of the system method in mapping TS may be explained as follows: (1) The system matches the hierarchical structure of most surface cover classification strategies.
(2) It enhances the relative spectral differentiation among those surface cover classes (e.g., TS) with similar spectral and textural characteristics. To efficiently use the system method, how to determine appropriate thresholds for corresponding levels is a challenge because a trial-anderror or a quantitative approach is needed [118].

Current Limitations and Constraints.
There are many factors that control and affect the success and accuracy of TS classification. The factors may include spatial/spectral resolutions, temporal resolution or seasonal effect, forest environmental setting/background, shadow and shaded effect, an appropriate image segmentation, spectral variation in IOs (tree crowns), and different TS overlapping effects [159]. The literature review indicates that current methods/techniques used for accurately classifying and mapping TS under different environmental settings and backgrounds have some limitations and/or constraints as one or more of the factors cannot meet its/their ideal requirement(s) (e.g., an image at a better than 1 m resolution and acquired in an appropriate season). As discussed in Section 3.4.1, usually object-based approaches can improve TS classification with VHR image data compared to pixel-based approaches [74,150,152]. However, there are many issues related to object-based approaches that need to be addressed for decision-making for TS mapping: (a) How do we ensure an appreciate image segmentation? (b) How do you deal with a multicrown (overlap) species image object (IO)? (c) How do you map subcanopy TS (i.e., young or small trees below the main canopy)? Per (a), to create an appropriate image segmentation for mapping TS, how to accurately select the scale parameter and the subset of input image bands (optical sensor images and/or LiDAR data) is important. A stepwise approximation method may be adopted to determine the optimal scale value and subset of image bands based on reference ITCs (IOs) [118]). Per (b), a super resolution mapping approach may be considered to determine the spatial distribution of multicrown/species within an IO [18]. For (c), full-waveform LIDAR data may be used to aid in identifying and mapping subcanopy TS [25].
When many studies on TS classification with a singlesensor image (either optical or LiDAR sensor' data) discussed their limitations or constraints, they often mentioned the factors: spatial, spectral, and/or temporal resolutions, which is/are not high enough. For example, when Alonzo et al. [15] discriminated among 15 urban TS using a single HS AVIRIS imagery acquired over Santa Barbara, CA, USA, they excluded palm tree species because their crown size is smaller than an image pixel size (3.7 m). In assessing individual species classification accuracies, they found that the highest accuracies were for those species with large and densely foliaged crowns, while the lowest accuracies had the smallest crown areas, suggesting the 3.7 m spatial resolution (even though this is a high spectral resolution) is not enough for those species with small crown size. Pu and Landry [118] found that it is difficult or not possible to accurately classify three oak TS with the current WV2 spectral resolution. To overcome the limitations (relatively low spatial and spectral resolutions from a single-sensor image) 17 Journal of Remote Sensing for accurately identifying some TS with a small crown size and similar spectral property among them, future work may consider using more sensors' data, such as adding VHR MS imagery and/or LiDAR data for Alonzo et al.'s [15] project and HS sensor's data for the work in [118].
One of the most common research objectives for many studies on TS classification was exploring and assessing the potential of a new sensor's data or newly developed method/algorithm in a single and relatively small test site (area). Any assessment results and findings derived from such a single and small test site may be unreliable and biased. With regard to this, the following reasons are probably easily understood: (1) The single and small test area cannot fully reflect and account for the large variation of ecological conditions and environmental settings that assessed new sensor data or a tested method/algorithm need(s) to account for. (2) The collected reference data from such a single small test site might have been biased (e.g., leading to a narrow spectral variation within individual TS) towards some favorable conditions. Consequently, the assessed results or tested method calibrated and validated with the biased reference data probably have been biased. To optimize the newly developed method/algorithm or accurately assess a new sensor data, future studies should consider more than one test site or choose a large test area, in both of which a full variation of ecological conditions and environmental settings can be ensured. In addition, if limited by certain conditions, e.g., having to make do with a small test site which is the only one available, the use of comparable datasets already existing in the literature is suggested to cross check the reliability and accuracy of the assessed results and tested method/algorithm with the data collected from the small test site.
When discussing the advantages of using the multitemporal method for mapping TS with multiseasonal images, most studies on using multiseasonal images just chose one optimal seasonal image or a combination of a subset of seasonal images under investigation, not fully utilizing all individual season/date images. For example, Yang et al. [163] chose an optimal seasonal image and a combination of three seasonal images selected from all four seasonal images to classify individual TS. Ferreira et al. [112] found that a combined use of wet and dry seasonal images did not improve the classification accuracy of the species compared to using both seasonal images separately. With regard to this, to fully and efficiently use phonologic change information provided by all individual season/date images under investigation, a seasonal trajectory difference index (STDI), recommended by [19]), may be adopted. The index integrates all seasonal images' contribution and potentially, as an additional input feature, is helpful for further improving TS identification and mapping. More potential indices or measures by integrating full phenological change information available from all seasonal images may be considered and developed in a future work.
5.2. Discussion on Future Directions. Although in recent years the advance of remote sensing in improving both spatial and spectral resolution has been achieved, given relatively small crown sizes of some TS (e.g., palm tree) and similar spectral properties among some TS (e.g., evergreen oak species), the spatial and spectral resolutions of current sensors' images may not be high enough to discriminate among those species. Therefore, further advancing remote sensing sensors in improving spatial/spectral resolutions for accurately mapping TS is necessary. In addition, to improve TS classification and mapping with existing optical and LiDAR sensors' data, we can make efforts on improving and enhancing the three "multiple" methods (i.e., multisensor, multitemporal, and multilevel classification system) discussed in Section 4 and data fusion techniques/methods for integrating multisensor data in Section 3.
Given the explicit advantages of the three "multiple" methods summarized in Table 6, these methods are truly beneficial to improving TS classification with various sensors' data. Future work may focus on overcoming the limitations and on processing the control factors summarized in the table by improving process chains and relevant algorithms. For example, per the multisensor methods, logically and efficiently improving process chains for automatically processing multisensor (optical sensors and optical sensor and LiDAR) data to increase the effectiveness of mapping TS is an interesting direction. It may emphasize on developing relevant automatic processing algorithms or tools to process resampling and registration between the multisensor data, more important on radiometric pixel value normalization between different sensors' data due to the difference of image acquisition times. Per the multitemporal methods, to efficiently utilize the phonologic information difference among different TS from multiseasonal images for improving TS classification, accurately normalizing background contributions (e.g., soil and atmosphere) to tree canopy spectra for multiseasonal images is a key step of a processing chain for the method. With regard to this, how to fully utilize phenological information from all of the seasonal data under investigation is a future direction, especially in mapping urban TS with the method (e.g., developing some indices by integrating all seasonal data). For speeding processing chains for the multilevel classification system, how to define and determine thresholds at different process levels is a key for successful application of the method. A qualified reference data set for defining and quantitatively determining the thresholds and some statistical tools are required for this issue.
To efficiently use the "multiple" methods for mapping TS with various sensors' data, the key step of processing chains is of data fusion with different sensors or different seasonal data. There are several mature techniques/algorithms that can be used to fuse the same sensor's data (e.g., a pansharpening technique for fusing the eight MS bands and one Pan band of WV2 imagery), different sensors' data (e.g., six MS TM bands with one SPOT Pan band), and spatiotemporal data fusion (STDF) methods (e.g., the STARFM model fusing of Landsat data with MODIS data). However, the current data fusion methods with different sensors' image data are just a simple addition (by combination or integration) of the different sensors' data, rather than really fused together as a resultant image by a pansharpening process. Per the STDF methods, currently, the data fusion 18 Journal of Remote Sensing methods can create moderate spatial resolution (e.g., Landsat-like) data series, which cannot be directly used for improving TS classification. Therefore, how to develop new techniques or algorithms including STDF to fuse different sensors' images (e.g., MS TM bands fused with WV2 MS bands and WV2 MS bands with LiDAR CHM data) should be considered as a future direction. More research on using various HS sensors' data to increase mapping TS accuracy, especially discriminating among those species with similar crown spectral property, will be an important direction. Besides commonly used airborne HS sensors (e.g., AVIRIS and CASI), some satellite HS sensors/systems (e.g., the HyspIRI system operation scheduled in 2023) at high spectral resolution and high temporal resolution but low spatial resolution and flexible UAVbased HRS systems at very high spectral/spatial resolutions have a potential of mapping TS or estimating TS abundance more correctly and accurately. To efficiently use the richer and delicate spectral information of the satellite HS data for mapping TS, spectral mixture analysis (SMA) or multiendmember SMA to automatically estimate individual TS abundance within pixels remains to be an important information extraction task. Therefore, developing effective spectral unmixing algorithms, especially HS image processing chains to automatically extract and map TS spectral information from the satellite HS data continues to be a research topic. Given the unique characteristics (low-cost, agile, and autonomous opportunity) of UAV-based remote sensing systems, future works by using the UAV-based sensors' data (MS and HS sensors and LiDAR sensors) may focus on (1) research on calibrating and verifying methods/algorithms for TS classification and (2) applications of directly utilizing the UAV-based sensors' data to identify and map TS.

Conclusions and Remarks
In this study, a total of 231 publications associated with investigating tree species (TS) classification and mapping using various remote sensing sensors' images were reviewed. With regard to this, an overview on applications of various sensors' data to TS classification was conducted, especially on studies on applications and research with VHR satellite multispectral data, airborne hyperspectral data, and UAVbased multi-/hyperspectral and airborne LiDAR sensors' data. After performing the review, several conclusions with remarks and recommendations are summarized as follows.
(1) During last two decades, most application studies on TS classification were using VHR satellite MS data and airborne HS data and LiDAR data. This was just a perfect match during the advent of advanced remote sensing technologies (VHR satellite MS and airborne HS sensors/systems during the late 1990s; although the LiDAR technique started in 1960s, its application in remote sensing in TS mapping was most popular after 2000). This is because the properties of high spatial/spectral resolutions and vertical/geometric features are very suitable for TS classification and mapping (2) Given the explicit advantages of the three "multiple" methods summarized in Table 6, the three categories of "multiple" methods by efficiently integrating information of multiple spatial/spectral and temporal/seasonal data, vertical/geometric features, and a reasonable and logic structure classification system matching different scales of landscape unit and surface cover unit/patch, provide a potential of improving TS classification by using the multisensor and multisource data (3) To improve TS classification with extracted and selected various features from different sensors' data, after 2010, the most popularly used classification methods/techniques are machine learning including deep learning techniques/models (e.g., 3D-CNN). Usually, such advanced methods/algorithms/models can produce higher classification and mapping accuracies compared to those created with other traditional methods/techniques (4) Recently, application and development of UAVbased remote sensing seasons/systems (MS, HS, and point cloud sensors) for identifying and mapping TS have attracted the attention of researchers and practitioners. Considering the nature of the low cost, agility, and autonomy of UAV-based systems, the UAV-based remote sensing data have been used as either reference data for research or directly applied for mapping and classifying TS (5) There are three limitations that were identified and suggested to overcome in future works. They include (1) using a single sensor's data, limited by spatial/spectral and sometime temporal resolution; (2) using a single small test site (area), limited by unrepresentative ecological conditions and environmental settings, and image and reference data under investigation, and coming out as unreliable and biased results and findings; and (3) not fully using all seasonal phenological change information from all seasonal images under study. Future works may need additional higher spatial and spectral and multiseasonal image data, and use more test sites or a large enough study area; otherwise, the assessed results and derived findings need to be verified and checked with comparable studies in the literature and some indices or metrics must be developed to integrate all seasonal images under investigation (6) Three future directions are recommended: (1) refine the three types of "multiple" methods for TS classification and mapping by overcoming the limitations and processing the control factors summarized in Table 6; (2) develop novel data fusion algorithms or processing chains with different sensors or different seasonal data; and (3) develop new spectral unmixing algorithms, especially HS image processing chains, to automatically extract and map TS spectral information from satellite HS data. For the three directions, future works may focus on 19 Journal of Remote Sensing developing efficient and automatic image processing chains or algorithms (resampling, registration and radiometric normalization) and speeding processing chains for the multilevel classification system by defining and determining thresholds for different process levels for (1); developing novel techniques or algorithms to fuse different sensors' images, similar to a pansharpening algorithm for (2); and for (3), developing new spectral unmixing algorithms including nonlinear algorithms to efficiently unmix satellite HS data