Confidence Measure of the Shallow-Water Bathymetry Map Obtained through the Fusion of Lidar and Multiband Image Data

With the advancement of Lidar technology, bottom depth (H) of optically shallow waters (OSW) can be measured accurately with an airborne or space-borne Lidar system (HLidar hereafter), but this data product consists of a line format, rather than the desired charts or maps, particularly when the Lidar system is on a satellite. Meanwhile, radiometric measurements frommultiband imagers can also be used to infer H (H imager hereafter) of OSW with variable accuracy, though a map of bottom depth can be obtained. It is logical and advantageous to use the two data sources from collocated measurements to generate a more accurate bathymetry map of OSW, where usually image-specific empirical algorithms are developed and applied. Here, after an overview of both the empirical and semianalytical algorithms for the estimation of H from multiband imagers, we emphasize that the uncertainty of Himager varies spatially, although it is straightforward to draw regressions between HLidar and radiometric data for the generation of Himager. Further, we present a prototype system to map the confidence of Himager pixel-wise, which has been lacking until today in the practices of passive remote sensing of bathymetry. We advocate the generation of a confidence measure in parallel with H imager, which is important and urgent for broad user communities.


Introduction
The ocean depth is a geophysical property puzzling humans for thousands of years. The answer not only satisfies curiosity but also is important for many aspects of human activities and scientific studies, including navigation, ecosystem management, sustainable economic development, and ocean dynamic modeling [1]. To determine ocean bathymetry, the ancient Greeks (circa 80 BCE) used the "line sounding" method and obtained depth measurements up to~2000 m in the Mediterranean Sea, while James Clark Ross obtained a depth of 4893 m in the 1840s. All these measurements were based on ship surveys; hence, it is unsurprising that the Challenger expedition (1872-1876), an oceanic voyage explicitly targeting ocean bathymetry, obtained just 492 soundings of the Atlantic Ocean. Only after the invention of Sonar during World War II did great achievements occur in ocean bathymetry or bottom topography. Still, because the ocean bottom is covered by a thick layer of water, which is generally opaque to electromagnetic radiation, we still know much less about the seafloor compared to what we know about the surface of the Moon or Mars [2].
Since the launch and operation of satellites, our capability to observe and measure ocean bathymetry has significantly improved, where sea surface altimetry has been successfully used to indirectly infer the bathymetry [3,4]. This approach cannot resolve small-scale variations and can only detect large seamounts that alter the earth's gravitational field and subsequently the sea surface altimetry. One direct and precise measurement of bathymetry from airborne or space-borne platforms is Lidar (light detection and ranging) [5], which uses time lapses between emission and receiving of photons interacting with the bottom (or a target) to calculate the distance photons traveled. This time-based technique can be used to accurately calculate the bottom depth of clear oceanic waters up to about 40 m at present [6,7]. We commonly term this technique as active remote sensing of bottom depth and here use H Lidar to represent the product obtained (see Table 1 for major symbols and acronyms used in this article). Although Lidar is only feasible for relatively shallow and clear waters, due to the significance of such regions, there are many airborne Lidar systems specifically developed for bathymetry [5,8]. One extremely exciting and valuable development is the ICESat-2 satellite system [9], which sends out a laser at 532 nm, has a vertical resolution of about 0.17 m for bathymetry, and can potentially provide H Lidar for various nearshore regions of the world [7]. This Lidar system, due to its space-borne nature, obtains measurements of mesh-style points (in the various lines dictated by the Lidar system and satellite orbit), not the desired bathymetry map.
A completely different approach of the optical measurement of bathymetry is based on radiative transfer, where a shallow bottom will affect the radiance emerging from below the sea surface. A quick and simple example is the shallow vs. deep ends of a swimming pool that appear as different colors to humans. Algorithms were then developed in the 1970s to use radiometric data from multiband imagers to estimate the bottom depth of shallow clear waters [10][11][12]. This image-based remote sensing of the bottom depth (H imager hereafter) is commonly termed as passive remote sensing. Although such estimates of depth are not precise, a significant advantage is that a map of H imager can be produced, especially with the launch and operation of more advanced sensors [1]. In recent decades, with an improved understanding of radiative transfer in optically shallow waters (OSW), more sophisticated algorithms based on radiative transfer were developed [13][14][15][16][17][18][19], resulting in the creation of more bathymetry maps from imagers [19][20][21][22]. For those algorithms based on the physics of radiative transfer, a priori depths are not required for the development of the algorithm, and H imager can be produced as long as the input reflectance spectrum is highly reliable and has an adequate spectral resolution. However, since both the water and bottom properties affect the reflectance spectrum, there are still various uncertainties in the derived H imager product from the reflectance spectrum.
In view of the availability of concurrent or collocated highly accurate satellite-based H Lidar and high-spatial resolution multiband imagers, it is logical to develop schemes to generate bathymetry maps of OSW through the fusion of the two data sources [23,24]. Figure 1 shows an example of measurements captured by ICESat-2 (red dashed line) and Landsat-8 operational land imager (OLI) over the Great Bahama Bank, where it is desired to expand the ICESat-2 bathymetry to the entire shallow regions covered by the Landsat-8 acquisition. As demonstrated in many studies, when collocated measurements of both H and radiometric properties are available, the generation of H imager through explicit empirical regressions [25][26][27][28] or neural networks [29][30][31] becomes straightforward. However, these H imager products [25,32] lack a representation of the confidence or quality of the product at each pixel, although schemes to estimate the impact of radiometric noise on H imager have been developed [21,33]. Usually, an averaged root mean square error or mean relative error of an algorithm is provided, but such measures of error represent the performance from a data pool and do not mean the same error or confidence at every pixel or location [34]. After the first demonstration of passive remote sensing of bottom depth [10,12], what limited the broad-scale application of the H imager product was not a lack of algorithms but rather the lack of a pixelwise confidence measure for such products. Brando et al. [35] suggested using the closure between the measured and modeled remote sensing reflectance to infer the quality of H imager . However, as this closure is an index for numerical solutions of a complex remote sensing function (see Section 2.3), a good closure does not necessarily indicate a high confidence of the derived H imager . At present, the ability to generate a pixel-wise measure of the quality or confidence of H imager through theoretical modeling is lacking, despite the necessity of such a measurement as it would vary spatially even within clear waters. In this work, after review and demonstration of traditional passive schemes for bathymetry, we provide an original and novel prototype system to objectively classify the confidence of H imager . We advocate the generation of such confidence products in parallel to remotely sensed bathymetry, where such a confidence measure would be essential to promote the use of H imager by the broader community.

Overview of Passive Remote Sensing Algorithms for Shallow-Water Bathymetry
Detailed descriptions of the derivation of H from a Lidar system can be found in Guenther [5], where the key is to obtain precise measurements of time lapses of photons reflected by a sea bottom. For a water body with a shallow bottom, the remote sensing reflectance (R rs , in sr -1 )-the ratio of the radiance (L w ) emerging from below the surface to the downwelling irradiance just above the surface-can be expressed as [36][37][38] R rs = R dp Here, R dp rs is the remote sensing reflectance of the same water body but with no impact from the bottom (i.e., optically deep); ρ is the bottom reflectance modified by air-sea transmittance; K d is the diffuse attenuation coefficient of downwelling irradiance, with K C u and K B u for the diffuse attenuation of upwelling photons generated due to scattering in the water column and bottom reflection, respectively. R dp rs , K C u , and K B u can be parameterized with the water inherent optical properties [13,36,39]. Thus, while R rs depends on the water optical properties and bottom reflectance, it is also a function of the bottom depth (H). Hence, various passive remote sensing algorithms have been developed to retrieve H from this spectral signal [11,13,21,40]. These algorithms can be grouped into three approaches, two of which belong to the empirically based approach (EBA) and the other classed as a semianalytical approach (SAA). The following briefly describes the essence of these three schemes.

Explicit Empirical Approach (EEA)
2.1.1. Multiband Value Algorithm (MBVA). With collocated measurements of bottom depth and multiband radiometric data, Polcyn et al. [10] proposed the first empirical algorithm for H based on the difference between the shallow-and deepwater L w , with the algorithm further refined by Lyzenga [12]. Generally, H imager can be written as Here, Y is the logarithm of (R rs − R dp rs ) or (L w − L dp w ) and is calculated for a specific spectral band, while a 0−1 are empirical coefficients tuned using collocated R rs (or L w ) and H. This algorithm can be improved with the use of additional bands: N is the number of bands that are available and feasible from an imager; thus, there are N + 1 algorithm coefficients (a 0−N ) to be tuned. Since Equations (2) and (3) are empirical, there is a potential that R rs is smaller than R dp rs , e.g., over dark seagrass regions, which then causes an invalid mathematical calculation for Y; this formula could be modified as [41] Since bottom depth is related directly to the value of R rs , we term this empirical model for H imager as a multiband value approach (MBVA). Comparing Equation (4) with Equations (1)-(3), they are essentially the same, except that the Y (a difference between the shallow and deep regions in an image) in Equations (2) and (3) is replaced by R rs in Equation (4).
2.1.2. Two-Band Ratio Algorithm (TBRA). Also, recognizing that the difference between R rs and R dp rs could be negative The three model coefficients (n, m 0,1 ) are also tuned using pairs of collocated R rs and H. The value of n ranges 500-1500 and is usually fixed at 1000 [25,42], while Traganos and Reinartz [34] indicated that a value of n = 1:0 works fine for a seagrass environment. As demonstrated in many studies [24,25,42], maps of bathymetry can be generated following this scheme. It is also possible to use the logarithm of the R rs ratio from two bands for the empirical estimation of H [43]. However, Traganos et al. [41] found that the performance is worse than using the formulation given by Equation (5); hence, its discussion is omitted here. Because this approach employs a ratio of R rs at two bands, we term this scheme a two-band ratio algorithm (TBRA), although Equation (5) can be expanded to include more available bands. The algorithms following Equations (2)-(5) are data-based (empirical), where algorithm relationships and coefficients are explicit, and the coefficients are driven by pairs of known H and R rs . Note that if R rs from more bands are required, it then places higher demands on atmospheric correction, especially in the longer wavelengths over optically shallow regions, where presently the OLI R rs value in the red band sometimes is invalid.

Implicit Empirical Approach (IEA)
2.2.1. Estimation from R rs . Different from the explicit empirical algorithms shown above, the machine learning approach (MLA, which in this manuscript collectively stands for neural networks, machine learning, and deep learning) is another data-based approach for the estimation of H from remote sensing measurements [29,30,44,45]. Unlike EEA, the algorithm dependence or relationships and coefficients of MLA are hidden in the computer programming architecture (various layers and neurons), so it is not obvious how H imager varies with R rs or spectral radiance. As there are no explicit equations or parameters for such an approach; conceptually, the algorithm can be expressed as Here, λ i is the available and feasible band for the estimation of H (e.g., usually Bands 1-4 for Landsat-8 OLI), although there are no specific restrictions of R rs that can be used.

Estimation from
Top-of-Atmosphere Reflectance. Given that machine learning is empirical, another way to utilize MLA for H estimation is to bypass atmospheric correction [46], thereby estimating H and/or water properties directly from the top-of-atmosphere reflectance (ρ toa ) [47]. Like Equation (7), this scheme can conceptually be defined as To implicitly account for the contribution of atmosphere to ρ toa , the range of λ i will be from the visible to NIR-SWIR bands (e.g., Bands 1-7 for Landsat-8 OLI). Similar to the algorithms in Section 2.1, when sufficient pairs of H and ρ toa are available, an MLA ρtoa can be developed for the remote sensing of H imager from ρ toa . While an EEA like Equation (5) could be developed with 5-10 data points, an MLA requires much more data (usually hundreds or more) in the training phase. In addition, an MLA is much more complex in the computer architecture than the simple mathematical formulation presented in Equations (2)-(5).

Semianalytical Approach (SAA).
A completely different set of algorithms for H is based on the radiative transfer equation. After parameterizing the spectra of inherent optical properties (IOPs) and bottom reflectance in Equation (1), an R rs spectrum of shallow water could be simplified with five variables and expressed as [13] Here, P and G represent the absorption coefficient of phytoplankton (a ph ) and detritus-gelbstoff (a dg ), X is the particle backscattering coefficient (b bp ), and B is the bottom reflectance, all set at a reference wavelength, such as 440 nm. The five variables can be solved numerically through spectral optimization (or minimization) by minimizing a cost function computed between the measured and modeled R rs spectra, defined as Provided there is a sufficient number of spectral bands and that the R rs spectrum is in high quality, H imager can be generated from image-based spectrometers without a priori data pairs of H and R rs [35,48,49]. In addition, the bottom substrate class and water optical properties could also be generated from this process [14,15,19]. The SAA is extremely valuable to measurements that have R rs only, but the retrievals depend on the quality of the R rs spectrum and the number of spectral bands [21,33,35,50]. For multiband imagery that has a limited number of spectral bands in the visible domain, such as Landsat-8 OLI and Sentinel-2 MSI, modifications on the SAA variables and processing are necessary [51]. Additionally, its computational load is significantly greater than that of EBA because an SAA solves 4-5 variables simultaneously for a given R rs spectrum; fortunately, this demand can be met with greatly improved computer technology.

Data and Processing
Landsat-8 OLI and ICESat-2 measurements are used here to demonstrate H imager maps from collocated Lidar data and multiband imagers and the generation of a pixel-wise H imager confidence score.
3.1. Landsat-8 Data. Landsat-8 is an extension of the earlier Landsat series [52] and was launched on February 11, 2013. Its OLI has seven bands in the~440-2200 nm domain to take measurements of the earth-atmosphere system. In particular, the spatial resolution of Landsat-8 OLI is 30 m, which provides detailed features of coastal regions where bottom depth can be highly heterogeneous. The Landsat-8 OLI Level-1 data processed by the Level-1 Product Generation System (LPGS) can be downloaded from the USGS website (https://glovis .usgs.gov/). The ρ toa image data were processed using the SeaDAS package (v7.5) [53], and the atmospheric correction algorithm proposed by Bailey et al. [54] was adopted for the generation of R rs . A low threshold of 0.0003 sr -1 is used for R rs ð665Þ if it is found that the obtained R rs ð665Þ is negative.
Here, a few images over Florida Bay (24. 3.2. ICESat-2 Data. ICESat-2 was launched on September 15, 2018. It is a follow-on mission to Ice, Cloud and land Elevation Satellite (ICESat) and provides global altimetry and atmospheric measurements with emphasis on surface elevation changes in polar regions [9]. The sole instrument onboard ICESat-2 is the Advanced Topographic Laser Altimeter System (ATLAS), a green (532 nm) wavelength, photoncounting laser altimeter with a 10 kHz pulse repetition rate [9,55]. ATLAS uses photomultiplier tubes (PMTs) as detectors in the photon-counting mode so that a single photon reflected back to the receiver triggers a detection within the ICESat-2 data acquisition system. This single-photonsensitive detection technique used by ATLAS to measure photon time of flight provides a very high vertical resolution required to detect small temporal changes in polar ice elevations [56,57], as well as the bottom depth of optically shallow waters [7].

Data Matchup and Statistical
Measures. Matchup datasets between the Landsat-8 OLI and ICESat-2 measurements were organized with the following processing steps: the Landsat-8 OLI pixels of dense clouds were first discriminated and removed based on the threshold of Rayleigh reflectance at the SWIR band (1238 nm) [58]. Meanwhile, the pixels of low-quality R rs were removed based on the standard Level-2 quality flags included in SeaDAS, which include ATMFAIL (atmospheric correction failure), LAND (land pixel), CLDICE (probable cloud or ice contamination), HILT (very high or saturated observed radiance), and HIGLINT (strong sun glint contamination).
The ICESat-2 bathymetry results presented in this work use geolocated photon data, contained in the ATL03 data product, which are segmented into granules that span about 1/14th of an orbit [59]. Both OLI R rs and ATL03 photon products include latitude and longitude information within the WGS-84 coordinate reference system.
Considering the variation of H (after tidal correction) is negligible within a short period, the time constraint for "concurrent" Landsat-8 OLI and ICESat-2 data is set as ±2 weeks, and the ICESat-2 H product is adjusted to match the tidal cycle of Landsat-8, where the classical tidal harmonic analysis model T_TIDE was applied to calculate tide information of interested locations [60]. To match the measurement between ICESat-2 H and OLI R rs , we first located the ICESat-2 track within the OLI image. For an ICESat-2 data point, a Landsat-8 pixel was first selected based on the closest distance. Since the spatial resolution of ICESat-2 along the orbit track is 0.7 m, while the footprint of OLI is 30 m, there are many ICESat-2 measurements within an OLI pixel. Therefore, for this OLI pixel, all ICESat-2 pixels within a radius of 15 m are used to calculate the mean H value and considered the matchup H Lidar for this OLI R rs .
To measure the deviation or error of the H imager product, in addition to the coefficient of determination between any two sets of data, the root mean square error (RMSE) and the mean absolute relative error (MARE) between H imager and H Lidar are calculated: where M is the total number of pairs used in the analyses. Note that the term "error," rather than the term "difference," is used in these analyses. This is because the uncertainty of H Lidar is very low (a few centimeters); thus, H Lidar could be considered the ground "truth," and any difference between H imager and H Lidar will be most likely in the system to produce H imager .

Predictability of Empirical Schemes
For a robust empirical scheme, the first aspect is to check if there are strong correlations between the input and the desired output, which is termed as predictability here and measured by the coefficient of determination in linear regression (R 2 ). A value of R 2 = 1:0 indicates a 100% predictability or certainty. For the case of bathymetry, the output is the bottom depth, while the input is the spectral information (spectra of ρ toa or R rs here) or value after its mathematical transformations (e.g., parameters Y or Z in Equations (2)- (5)). In the following, we use the compiled matchup datasets to show the different predictability of the abovementioned empirical schemes (TBRA, MBVA, MLA Rrs , and MLA ρtoa ).  [24,26,41] have shown strong predictability (R 2 ≥ 0:86 ) of EEA (TBRA or MBVA) for the estimation of H from R rs . Such high predictability is not always the case [28] (also see Table 2). Figure 2 shows matchup measurements (>3500 pairs) over the Great Bahama Bank, an environment with generally clear water and shallow depths [43,61,62]. The OLI R rs was obtained on March 7, 2019, where H Lidar (obtained on March 16, 2019) ranges~1.5-9.0 m after tide correction. The R 2 value between Z 2−3 (for the R rs ratio of OLI Band 2 and Band 3) and H Lidar is~0.36 (Figure 2 These values indicate that such a ratio at most explained <40% of the variance for this dataset, although the radiometric measurements came from the same image, and that the distance of the points (see the red dashed line in Figure 2(a)) spans~110 km. Most of the remaining variances (>60%) are likely from the water column and bottom properties (i.e., assuming that uncertainties from sun-sensor geometry and atmospheric properties can be omitted), and these variations could not be resolved from Z 2−3 . These R 2 values are significantly lower than those reported in previous studies [24,26,63], indicating a high data or environmental dependence of TBRA and its algorithm coefficients (m 0−1 ). The use of 20 or so data pairs to obtain a stable set of m 0−1 [63,64] will likely be a special case, rather than a common situation. This also echoes the findings of earlier studies [25,27,65] that one set of empirical coefficients cannot satisfy all pixels, even for the same image, unless the threshold for acceptable uncertainty is relaxed.
The R 2 value increases to 0.88 (Figure 2(c)) if MBVA (Equation (5)) is used to predict H Lidar for the same dataset, indicating significantly higher predictability of MBVA for this dataset or environment. The results are even better (R 2 as 0.91), although not much, if an MLA with 1 hidden layer and 5 neurons (for an R rs spectrum containing four visible bands) is used (Figure 2(d) and Table 2). These results highlight the importance of using more bands [66], explicitly (MBVA) or implicitly (MLA), to account for the likely changes of water properties and bottom substrates across an image.
In an MLA, each neuron is similar to a free variable in a multivariant nonlinear regression; thus, more free variables tend to improve regressions. For the MLA with 1 hidden layer and 5 neurons (for R rs ), the number of free variables is equivalent to that for MBVA; therefore, the results suggest an improved capability of MLA to pick up hidden relationships between R rs and H Lidar . This predictability is further improved with a deep learning architecture of using 3 hidden layers and more neurons (see Table 2), indicating a great potential of machine learning for empirical estimation of bottom depth. Further, it is found that the statistical measures are nearly the same (see Table 2) between using R rs ð443-655 nmÞ and using ρ toa ð443-2200 nmÞ (1 hidden layer with 8 neurons, i.e., number of spectral bands plus 1) as the input to an MLA. These results suggest that through a nonlinear scheme like MLA, it is feasible to bypass the atmospheric correction step to retrieve H directly from the top-ofatmosphere measurements [46].  Table 3). MLA ρtoa performs slightly better than MLA Rrs across these multiple images where there are various atmospheric properties, further supporting the concept of obtaining H imager from ρ toa when MLA is applicable. The improved predictability of MBVA and MLA is echoed by the MARE and RMSE values (see Tables 2 and 3), which are calculated after the model coefficients are determined through tuning or training. For instance, the MARE value is~10% with RMSE as 0.54 m for MBVA, and MARE is~8% with RMSE of 0.52 m for MLA. However, TBRA achieves a MARE value of 27% with RMSE as 1.25 m (see Table 3), which are about three times the values obtained using MBVA and MLA. These evaluations indicate improved predictability of using more bands, rather than using information from two bands, for the calculation of H imager .
Such a result is expected because, as shown by Equation (9), R rs of shallow water is governed by at least 4-5 variables; thus, a ratio of R rs at two bands cannot resolve all unknowns, unless some of them are nearly constant or covarying with each other for a region of interest. However, even for this region in the Bahamas, as shown by Barnes et al. [61] and Garcia et al. [62], their H, IOPs, and bottom substrates vary spatially. This is further evidenced by the spatial variation of að440Þ (see Figure 3) derived from HOPE for the matchup data in Figure   (and no covariance with H), which plays a key role in the spectral variation of an R rs spectrum, more bands and more free variables in an algorithm would improve the predictability. Note that að440Þ in Figure 3 is derived from HOPE with H fixed as H Lidar from ICESat-2 (after tidal correction) and G fixed as 0.002 m -1 (see Section 5.2 for details), so the only variables are P, B, and X for Equation (9); therefore, the resulted P values (and then að440Þ) are more reliable after the reduction of variables.

Applicability and Confidence Measure
The ultimate goal of any algorithm is to apply it to new measurements, i.e., data not used in the tuning or training, in order to obtain the desired remotely sensed product. While an empirical algorithm for H can be easily developed from collocated imagery and H Lidar data, the extent that such an algorithm can be applied to new data is unknown. It has been demonstrated that the model coefficients (e.g., a 0,1 and m 0,1 in Equations (2)-(5)) developed from one image cannot be applied to another image if low uncertainties are the goal [26]. The scatter in the regression shown in Figure 2 indicates that these empirical coefficients may not be applicable even for locations within the same image, unless larger uncertainties are acceptable.
Conventionally, the applicability of an algorithm is assessed by evaluating its performance using an independent dataset, with the reported RMSE and/or MARE values as justifications [25][26][27]. It is necessary to keep in mind that such averages, although informative, are dependent on the data pool and do not represent the error or uncertainty of each pixel [34]. Because different users have different tolerance for the uncertainty of H imager (e.g., high requirement of accuracy for navigation), the average error is insufficient to inform all users of H imager . It is necessary and important to provide a confidence measure for H imager products at each grid or pixel. The following addresses the confidence associ-ated with both EBA and SSA, with a first-ever attempt to provide a pixel-wise confidence measure for H imager .

Issues of the H imager Map from Landsat-8 OLI.
Following the practices commonly presented in the literature [23,25], a map of 30 m resolution H imager (see Figure 4) over the Great Bahama Bank was generated from a Landsat-8 OLI image (May 26, 2019) with a TBRA tuned using matchup data generated from this image and ICESat-2 bathymetry (May 25, 2019, the red dashed line in this map, and a total of 1707 pairs of data). As shown in the literature [23,25] and as desired, the discrete or line-type bathymetry product from ICESat-2 is now expanded to form a bathymetry map. Overall, for the western side of Andres Island, Great Bahama Bank, the bottom depth ranged from~2.0 to 8.0 m. This is consistent with our general understanding and depth retrievals from other observations and methods [61,62,67]. On average, the difference is~28.0% when compared with that derived from MERIS [62], which is consistent with those reported in the literature [26,28]. However, obviously, there are erroneous outputs, where the bathymetry is~15.0 m for the Tongue of the Ocean (TOTO), which is known to be~2000 meters deep. In other words, a "false positive" of the shallow bottom is derived from TBRA (similar "shallow bottom" for TOTO is also observed from MBVA and MLA, results not shown here). Such false positives can also be found in Caballero and Stumpf [26] for waters around Dry Tortugas, Key West (see Fig. 6d of Caballero and Stumpf [26]). These false positives are a result of the following two inherent limitations of empirical algorithms for H: (1) Empirical algorithms for H (e.g., Equations (2)- (8)) are developed using data from optically shallow waters, as only such data has an optical signal from the bottom (2) By design, the empirical algorithm is data-driven; i.e., it can only be applied to measurement with similar characteristics as the training data When an EBA, such as the TBRA, is applied to multiband imagery, however, these two basic requirements or assumptions are hardly tested or evaluated a priori. In other words, an H imager map was generated by assuming, blindly, the algorithm is applicable to any R rs in an image used during the algorithm tuning. Consequently, an erroneous bottom depth over TOTO was generated (see Figure 4). For this image, we know TOTO is optically deep, so such false products can be easily ignored or masked out. However, not all locations or pixels within an image do we know a priori their optical properties; thus, it is not certain if the environment is optically shallow or not. Therefore, it is not straightforward to mask out optically deep waters with empirical algorithms such as Equations (2)- (8). As such, usually, the resultant H imager of deeper depths were manually, and arbitrarily, masked out (e.g., [68]).

Criteria to Check the Applicability of an Empirical
Algorithm for H imager . To confidently apply an EBA H imager algorithm to a new R rs spectrum, to the least, it is important and necessary to check if the R rs from this target location meets the following two criteria: (1) Criterion 1 (Cr1). If it is optically shallow; and (2) Criterion 2 (Cr2). If there is an identical or similar R rs spectrum in the training pool.
Such two criteria are omitted or ignored at present in the practices related to EBA for H imager , although an SAA can separate optically deep vs. optically shallow during data processing [20,35].
Given that the EBA formulation for deriving depth cannot provide information on whether a pixel is optically deep or shallow, a neural network (NN OSW ) based on Multilayer Perception (MLP) was developed to aid in the determination of Cr1. MLP is a class of feedforward artificial neural networks (ANN) composed of one input layer, one or multiple hidden layers associated with one output layer. Since here Landsat data are used, values of R rs ð443Þ, R rs ð482Þ, R rs ð561Þ, and R rs ð655Þ are the input, while the output is optically deep (assigned a value of 0) or shallow (assigned a value of 1). The number of hidden layers and the number of neurons of each layer were determined following the concept of minimum loss, a common approach for developing a deep learning system. Data used for the training came from known optically deep (Landsat measurements in Massachusetts Bay, Chesapeake Bay, and TOTO) and optically shallow (Great Bahama Bank, Florida Keys) environments. After many training attempts, two hidden layers with 32 and 16 neurons were found to provide the best performance for this separation. Separately, the Rectified Linear Unit (ReLu) function for the activation function of hidden layers is employed, which can largely avoid gradient disappearance. Since it is binary classification, the activation function of the output layer is a Sigmoid function. The training stage was eventually achieved when the iteration is stopped, and the loss function converges. Figure 5 shows an example of the OSW classification after applying NN OSW to the Landsat-8 OLI image displayed in Figure 1. Although the deep vs. shallow separation may not be perfect at this stage, waters of TOTO are optically deep and clearly separated. Since all neural network-based algorithms are data-driven, we envision that this initial NN OSW will be updated after more optically deep and shallow data are employed.
Further, a similarity index (SIM Rrs ) is designed for Cr2, such that the higher the SIM Rrs value, the higher the measure of similarity; therefore, the target R rs spectrum is likely  Figure 5: Classification of optically shallow waters (red color) with a neural network scheme. Blue, grey, and white colors represent optically deep waters, land, and cloud or nonvalid R rs , respectively. There are more white pixels than in Figure 4, which is a result of negative or no R rs data at 561 or 655 nm. 9 Journal of Remote Sensing "learned" in the phase of algorithm development. In this effort, SIM Rrs is defined and calculated as follows.
A target R rs spectrum (R tar rs ) is evaluated against each R rs spectrum in the training pool (R ref ðjÞ MARD2 Rrs j ð Þ = Here, MARD Rrs represents a mean absolute relative difference between two R rs spectra, with j for the jth R ref rs . The calculations of Equations (12) and (13) show two ways of quantifying MARD, where a small difference in R rs at a single band plays a bigger role for MARD1 Rrs , while it is the large value in R rs that plays a larger role for MARD2 Rrs . Since there are many R ref ðjÞ rs in the data pool (1707 pairs in this case), many MARD1 Rrs and MARD2 Rrs values for a given R tar rs will be obtained; the minimum of the combination of MARD1 Rrs and MARD2 Rrs is selected for the quantification of SIM Rrs , calculated as The use of 50% for both MARD1 Rrs and MARD2 Rrs is a compromise between the two ways of evaluating spectral differences.

Confidence
Score System for H imager . We thus propose to use this similarity measure to gauge the likely quality of H imager . For instance, if SIM Rrs = 1:0 for an R tar rs , it indicates that there is an identical R ref rs in the training pool. Further, we know the absolute relative error of H (ARE H ) for each H imager in the data pool (see Figure 6 for example); thus, an identical ARE H for this R tar rs is expected from that for R ref rs , which can be found in the data pool. Therefore, based on the value of ARE H , the confidence or quality of each H imager can be classified as detailed below. Here, ARE H is calculated as While a low SIM Rrs value indicates low confidence in H imager (i.e., R tar rs is likely out of the data range in training), a high SIM Rrs value is not automatically a guarantee of high confidence or high accuracy of H imager . As shown in Figure 6, although the mean of the ARE H is~5% (R 2 value is 0.69 between H imager and H Lidar ) for the entire dataset, it does not suggest it is 5% for each point. For some data points, the ARE H could be as high as 20%. Thus, if this R tar rs matches the R ref rs having an ARE H of 20%, the H imager of this R tar rs is expected to have such a relative error from this algorithm.
Following the above indications, we designed a preliminary confidence score system (CSS) based on both SIM Rrs and ARE H in the data pool to classify the quality of the H imager product. At this initial stage, this CSS is designed to coarsely classify the confidence of H imager into three classes: low (score = 1), medium (score = 2), or high (score = 3), which is determined based on a decision tree (see Figure 7 for details). With this tree, H imager from an R tar rs spectrum having both high SIM Rrs values and low ARE H values can be considered to have high confidence. Figure 8 shows a map of confidence for the H imager product presented in Figure 4. Generally, for pixels not too far from the track of ICESat-2 and with depths~6.0-8.0 m, the confidence of H imager is high, a result of similar characteristics in the R rs data and the environment that were used in the training. For pixels near the coast of Andros Island and those  Figure 6); thus, this empirical algorithm did not "learn" the spectral characteristics of R rs of depths shallower thañ 4.0 m or with very different bottom and/or water properties (see Figure 3 for the wide variation of að440Þ). This low confidence is further confirmed using H Lidar obtained on March 16, 2019 (the black line in Figure 8). After adjusting the tidal cycle to match the image time of May 26, 2019, and assuming no significant changes of bottom topography in the past~70 days, it was found that ARE H is generally around 40% or more (see Figure 9), above the 25% criterion of low confidence. The overall accuracy of classifying low-confidence pixels is 80.2%, while the accuracy of classifying mediumand high-confidence pixels is~1%. This extremely low accuracy for medium-to high-confidence pixels is due to the low number of such data points (see Figure 3), thus statistically not significant. The less than perfect classification result, in part, results from a few (~200) measurements where H Lidar values are less than 4.0 m, but the ARE H values for these points are under 10%; i.e., they belong to the highconfidence category. This excellent performance of TBRA for these pixels deserves further studies as the range of H Lidar used in TBRA development was~5.0-8.0 m (see Figure 6). Nevertheless, these results (Figures 6, 8, and 9) highlight that, unlike Sonar-or Lidar-produced H where the uncertainty in H measurement is generally uniform, the uncertainty or confidence of the H imager product is far from uniform [34]; thus, it is important and necessary to have a pixel-wise measure of the quality of H imager . Further, the >80% success rate suggests that the CSS does provide a good indication of the confidence of H imager , although there is room for improvement.
Caballero and Stumpf [26] suggested the use of multiple acquisitions to measure the performance of an algorithm, with an assumption that bottom depth should remain the same (after tidal correction) for a short period of time. These multiple observations are useful and important [51], but they may not overcome systematic biases embedded in an empirical algorithm. One example is the "shallow bottom" of TOTO ( Figure 4); such "shallow bottom" will repeat itself when similar empirical algorithms are applied to new multiband images.

Applicability of SAA
5.3.1. Example of H imager from Landsat-8 OLI. SAA is not data-driven; its applicability is dependent on the R rs spectrum itself, as well as the bio-optical models and the simplified expression for R rs [13,21,35,50]. As articulated in many studies, SAA requires a highly accurate R rs spectrum as input, as errors in R rs will be propagated into the retrieved IOPs and/or H imager [21,33,35]. While empirical algorithms can overcome some systematic errors in R rs in the tuning or training phase, SAA, at least in its present form, cannot. In addition, the number of wavelengths plays an important role in the retrieval of H [50]. This is because, within an SAA, the IOPs and bottom properties are assumed independent variables, while empirical algorithms (especially MLA) may find, and remedy to some extent, some hidden relationships among them and therefore transfer systematic bias or relationships into the algorithm coefficients (explicitly or implicitly).
To demonstrate the retrieval of H imager from Landsat-8 OLI data with an SAA, the default HOPE algorithm was applied to the data pairs shown in Figure 2 11 Journal of Remote Sensing water bodies), Equation (9) is underdetermined. Considering that the spectral shapes of a ph and a dg in the 440-561 nm range are similar, and that a ph and a dg do not make significant contributions to the total absorption at 561 nm, here, G in Equation (9) was fixed as 0.002 m -1 in order to process Landsat-8 R rs , and this modified version is termed as HOPE LS8 . This fixed G value of 0.002 m -1 is simply a reflection that for waters in this region, it is close to the lowest value for a dg (440) [61]. Also, note that HOPE LS8 is certainly subject to refinement, but that is not the focus here. Figure 10(a) compares the profiles of H imager from HOPE LS8 with H Lidar from ICESat-2, which are two independent determinations, and an R 2 value of 0.66 was obtained. Figure 10(a) also shows the profile of ARE H for each pair, which ranges from 0.2 to 100%, with a median value of 17.7%. Although a generally consistent bathymetry pattern of H imager from HOPE LS8 is obtained, these statistical measures do suggest that substantially more effort is required if high-confidence H imager is to be retrieved by HOPE LS8 from such multiband R rs .
There could be many sources contributing to the moderate performance in the retrieved H imager . These include the sensor's calibration, the atmospheric correction, the biooptical models used in HOPE LS8 , or the number of available bands. It is not the scope of this effort to address the impact of those elements and the refinement of HOPE LS8 , where algorithm improvement is constantly ongoing. Here, we focus on the necessity and development of a CSS to measure the pixel-wise confidence of H imager retrieved from an SAA (such as HOPE LS8 ). Brando et al. [35] developed a system to classify the quality of H imager into two categories (good or bad) based on the err Rrs value (Equation (10)) and assumed that H imager has high confidence when a low err Rrs value (i.e., good closure between the measured and modeled R rs spectra) is obtained. However, err Rrs is determined by various components and many sources; there could be the same err Rrs but with different H results. For instance, when the bio-optical models are modified, a different H imager would be retrieved, but the value of err Rrs can remain the same. Thus, as demonstrated in Figure 10(b) and previous studies

12
Journal of Remote Sensing [13,35,49], there is no relationship between err Rrs and ARE H (R 2 is~0.1); thus, err Rrs is insufficient to indicate the quality of H imager when it is retrieved with an SAA. A small err Rrs , by its definition, indicates only a high agreement between the measured and modeled R rs spectra.

Confidence Score System for SAA-Derived H imager .
Following the CSS scheme for EBA, a prototype CSS for H imager from HOPE (CSS HOPE ) was also developed and presented in Figure 11. Since an SAA determines a set of solution using err Rrs , a first-order decision could be based on the value of err Rrs [13,35]. If the err Rrs value is higher than a threshold, it indicates that the closure between the input and output R rs spectra is not enough; thus, the retrieved H imager could be questionable [13,35]. Here, we tentatively set this threshold for err Rrs as 0.02, as most err Rrs values are found smaller than this value (see Figure 10(b)) for the data pool shown in Figure 2. When there are no collocated and reliable H data available, the maximum relative contribution from the bottom to the total R rs is used as an indicator to gauge the confidence of estimated H imager [20,35]. Too low a value (usually the threshold is 20% [20]) suggests a low contribution from the bottom and low confidence of retrieved bottom properties [35]. Since there are matchups between R rs and H Lidar , the CSS developed for EBA could be employed for pixels with err Rrs values below the threshold, as illustrated in Figure 11. Thus, for H imager derived from HOPE LS8 , a companion confidence score could also be produced.
As an example, Figure 12 shows a map of H imager obtained from HOPE LS8 (Figure 12(a)) and its confidence map (CSS HOPE , Figure 12(b)). Similar to the bathymetry map obtained from TBRA, the depth in the west of Andros Island obtained from HOPE LS8 also has a range of~4.0-8.0 m, a generally consistent pattern as observed before. For waters of the TOTO, the H imager map shows a depth of 20 m, which is basically the upper boundary preselected within the HOPE LS8 system, where actually the contribution from the bottom is negligible when processed with HOPE LS8 , so it can be easily marked as optically deep water as in Lee et al. [20] and Brando et al. [35]. More importantly, the pixel-wise quality of H imager (CSS HOPE ) shown in Figure 12(b) provides a clearer indication of the confidence on the H imager product pixel by pixel. Similarly, as Figure 8, higher confidence of H imager is found for pixels around the ICESat-2 track, and low confidence is found for locations near the coast. Evaluation using H Lidar (March 16, 2019) (the black dashed line in Figure 12(b)) indicates a success rate of~99% in identifying low-confidence pixels. In addition, High (3) Medium (2) Low (1) Figure 12: (a) As Figure 4, the H imager map obtained from Landsat-8 OLI R rs after applying HOPE LS8 . (b) Confidence score map for the H imager product shown on the left. Grey color for land, and white for cloud or nonvalid R rs .
13 Journal of Remote Sensing there are differences in the distributions of confidence between the two H imager products (see Figures 4 and 12), a clear indication of the performance of different approaches for bathymetry. On the other hand, because H imager from an SAA (e.g., HOPE LS8 here) is an independent determination of H from that of H Lidar , SAA offers an opportunity to check consistency from the two measurements, which is not possible with EBA.

Conclusions and Future Perspective
Through many decades of effort, there is no shortage of remote sensing products from multiband or hyperspectral imagers, but there is a shortage of remote sensing products attached with a confidence measure; this is especially true for the remote sensing of bathymetry. Compared to active measurements of bottom depth by Sonar or Lidar, the retrieved H imager is still facing difficulties in its applications by the broader community, where a key limiting factor until now is the lack of pixel-wise confidence for the H imager product.
To fill this void, a prototype confidence score system (CSS) for H imager is proposed for the first time, which at present classifies all pixels in an H imager map of OSW into three categories: low, medium, and high, with a preliminary set of criteria. Since this CSS involves both the algorithm coefficients and the data used for the development of empirical algorithms, it is logical that not only the algorithm function and model coefficients be reported but also the data pool used for the algorithm development be deposited in a common data portal. In the future, while it is always necessary and important to continue the refinement of these algorithms, it is also important, and urgent, to develop or revise or refine such system(s) to measure the confidence of the resulting H imager pixel by pixel. Specifically, it includes a refinement of the quality classes, thresholds, and settings of the criteria, as well as the desired statistical measures. Only the H imager product of high confidence from multiple images could be merged to form a reliable map for the broad user communities. We call on the ocean color community to refine such schemes or to develop brand-new systems, so a mature and widely endorsed system could be implemented to clearly measure the quality of H imager , a critical parallel product of remotely sensed bathymetry. To reach this goal, it is also urgent and important to compile, by the community and for the community, an inclusive data pool of collocated or concurrent measurements of H Lidar and high-quality ρ toa and R rs spectra of a wide range of depths and environments.

Data Availability
The satellite data used to support the findings of this study are publicly available. Landsat-8 data can be downloaded from the USGS website (https://glovis.usgs.gov/), while ICESat-2 data can be downloaded from the National Snow and Ice Data Center (NSIDC), where the geolocated photon data (ATL03) can be found online at https://nsidc.org/data/ atl03.

Conflicts of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.