Direct Retrieval of NO 2 Vertical Columns from UV-Vis (390-495 nm) Spectral Radiances Using a Neural Network

of modeled a priori pro ﬁ les present an outstanding bottleneck in operational satellite NO 2 retrievals. In this work, we instead use neural network (NN) models trained from over 360,000 radiative transfer (RT) simulations to translate satellite radiances across 390-495nm to total NO 2 vertical column (NO 2 C). Despite the wide variability of the many input parameters in the RT simulations, only a small number of key variables were found essential to the accurate prediction of NO 2 C, including observing angles, surface re ﬂ ectivity and altitude, and several key principal component scores of the radiances. In addition to the NO 2 C, the NN training and cross-validation experiments show that the wider retrieval window allows some information about the vertical distribution to be retrieved (e.g., extending the rightmost wavelength from 465 to 495 nm decreases the root-mean-square-error by 0.75%) under high-NO 2 C conditions. Applying to four months of TROPOMI data, the trained NN model shows strong ability to reproduce the NO 2 C observed by the ground-based Pandonia Global Network. The coe ﬃ cient of determination ( R 2 , 0.75) and normalized mean bias (NMB, -33%) are competitive with the level 2 operational TROPOMI product ( R 2 = 0 : 77 , NMB = − 29 % ) over clear ( geometric cloud fraction < 0 : 2 ) and polluted ( NO 2 C ≥ 7 : 5 × 10 15 molecules/cm 2 ) regions. The NN retrieval approach is ~ 12 times faster than predictions using high spatial resolution ( ~ 3km) a priori pro ﬁ les from chemical transport modeling, which is especially attractive to the handling of large volume satellite data.

With this promise, retrieval algorithms have been developed and continuously improved to translate the rapidly growing volume (terabytes per day) of earth observation data to physically relevant information (e.g., NO 2 abundance). The majority of existing tropospheric NO 2 retrievals include three steps: (1) NO 2 slant column density (SCD) determination from radiances through spectral fitting [29,30]; (2) stratospheric-tropospheric component separation [31][32][33], and (3) air mass factor (AMF) and vertical column density (VCD) calculation [12,34]. Large uncertainties exist in such retrievals as exemplified by the significant discrepancies among different products from the same sensor, e.g., the ozone monitoring instrument (OMI). Lamsal et al. [35] found that the standard OMI NO 2 product is 22% lower in winter but 42% higher in summer than the DOMINO product [36,37] over North America, although both were derived from the same SCDs. Regional OMI research products such as the BEHR [38] over North America and POMINO [39] over East Asia also exhibit systematic differences against global operational products due mainly to their locally refined AMF calculations. The AMF [34,38] relates the physically meaningful VCD with the optically represented SCD (i.e., the total NO 2 amount along all light beam paths received by the satellite instrument). The AMF is the largest source of VCD retrieval errors, containing a structural uncertainty estimated to be in the range of 30-50%, dependent on the representation and treatments of surface reflectivity, clouds, aerosols, and NO 2 vertical profile shape during the retrieval [13,40,41]. Higher-resolution inputs of these parameters in regional research products were found to alter the AMF and the retrieved VCD by up to 40-80%, with better resolved NO 2 spatial gradients and local enhancements [38,39,[42][43][44]. Valin et al. [45] showed that a model resolution of <12 km is needed to resolve accurate NO 2 gradients (as well as the a priori profile) around area sources such as cities (~1 km for point sources such as power plants). But a typical trade-off to provide such high-resolution inputs is the computational resources required, especially in terms of the a priori NO 2 vertical profiles which require the use of a chemical transport model.
Global operational products currently adopt profiles simulated at mesoscales (~1°or~100 km), which were close to the field of view (FOV) of early sensors like GOME (40 × 320 km 2 ), but are now over 30 times larger than the nadir FOVs of contemporary instruments such as OMI (13 × 24 km 2 ) and the TROPOspheric Monitoring Instrument (TROPOMI, 5:5 × 3:5 km 2 ). With typical computation environments (e.g., 1 CPU with~30 cores), the wall time needed to perform such global simulations for 1 month is 2-3 days, which scales as the square of the horizontal resolution [46], making long-term global simulation at local scale (~3 km) almost infeasible. Moreover, the simulated NO 2 profiles also suffer from errors in the model inputs (e.g., emission inventory) and assumptions (e.g., chemical mechanism). In addition, using climatological monthly model profiles was commonly adopted in operational retrievals [33,44]. Laughner et al. [47] showed that the NO 2 VCDs derived from profiles simulated for the exact day could differ by up to 40% due mainly to day-to-day changes in meteorology. For another instance, lightning NO x is critical for ozone production in the upper troposphere, while current model parameterization is not adequate to correctly simulate lightning strength [48]. Local VCD errors of up to 100% [49,50] were reported due to simulated profiles with modeled lightning, limiting the consequent application of these retrievals to constrain lightning. Finally, the retrieved NO 2 VCDs also further impact inverse modeling and data assimilation. A recent study suggested that inconsistent modeled profiles used in the retrieval and in the assimilation alone could increase the a posteriori NO x emission errors by up to 30% over polluted regions [51].
Satellite-observed radiances from nadir observations inherently contain information about the vertical distribution of species, provided that a broader spectral range is used, and the gaseous absorption is strong enough. The theoretical foundation of such vertical sensitivity based on radiative transfer (RT) was introduced in previous studies [52][53][54][55]. Briefly, the atmospheric scattering of molecules and aerosols decreases as a (up to -4th) power function of the wavelength in the UV-Vis range, making the sensitivity (e.g., weighting function or AMF) of spectral radiances to gas absorption also an inverse function of wavelength. Relative to higher altitudes in the atmosphere, gases at lower altitudes absorb the photons that are transmitted more strongly compared to scattering, yielding stronger spectral contrast of the AMF. This spectral contrast is more pronounced between more distant wavelengths, and is further enhanced by the temperature (i.e., altitude) dependency of NO 2 absorption strength across the UV-Vis [55]. Physics-based retrievals that exploit these underlying mechanisms have been applied to the retrievals of ozone [54] and SO 2 [52,53] vertical distributions. For NO 2 , such physics-based approaches have also been developed [56] for its column retrieval; however, they still relied on preassumed profile shapes. A recent sensitivity study based on the optimal estimation (OE) framework [57] suggested that physical retrieval of tropospheric NO 2 vertical distribution from satellite measurements is only possible under highly polluted conditions (e.g., boundary layer NO 2 column > 2 × 10 16 molecules/cm 2 ) where the information content (i.e., degree of freedom for signal) could reach 2. This low information content plus the high computational expense over a broader spectral range (i.e., 320-500 nm) discouraged attempts to perform purely OE-based determination of NO 2 VCD from satellite radiances. It is still unclear whether the dependence on a priori profiles of NO 2 retrievals could be loosened and whether a direct inference of NO 2 VCD from satellite observed radiance is achievable, amid the uncertainty of a priori profiles from models.
As an emerging and increasingly attractive tool to predict relevant parameters directly and efficiently from earth observation data [58], machine learning could be an alternative to further investigate this question. Although the interpretability of the prediction model is less than physics-based models, its computational efficiency is especially welcomed by the handling of dense satellite observations, and the model accuracy could be continuously improved by regular updates (iteration) including newly available training data. Capable of resembling underlying nonlinear relationship between relevant variables, the neural network (NN) approach has been applied widely to expedite forward RT modeling [59,60] and inverse retrieval processes [61][62][63][64] by linking satellite signals directly with relevant geophysical parameters using the large volume of a training dataset.
In this paper, we describe the development of an NN-based retrieval approach to directly determine the total vertical column of NO 2 from UV-Vis radiances, using TROPOMI observations as a testbed. We show for the first time that a direct inference of NO 2 column without presimulated a priori NO 2 vertical profiles has similar data quality compared to operational global retrievals (i.e., the standard TROPOMI products), under clear-sky and polluted conditions (i.e., total NO 2 ≥ 7:5 × 10 15 molecules/cm 2 and cloud fraction < 0:2). We built the NN model capabilities based on a synthetic radiance dataset that spans realistic clear-sky scenarios of observing conditions. Section 2 provides a detailed discussion of retrieval sensitivities. We then applied the NN prediction model to four months of TROPOMI data (Section 3) and performed a comparative and evaluation analysis with the standard products and ground-based measurements. Section 4 summarizes the discussion and points to various pathways for future research.

Neural Network-Based NO 2 Column
Retrieval Model The NN model (box 2 of Figure 1) to predict NO 2 column density from satellite radiances was generated from a large volume (>360,000 samples) of simulated spectra that span a wide range of realistic observing scenarios (Section 2.1 and box 1 of Figure 1). The model performance was evaluated via a cross-validation approach (Section 2.2 and box 3 of Figure 1).

RT Simulation.
We used the UNified Linearized Vector Radiative Transfer Model (UNL-VRTM) [65,66] to generate a synthetic dataset of radiances observed by a satellite instrument (i.e., TROPOMI) and the corresponding input variables (box 1 of Figure 1). UNL-VRTM facilitates a userfriendly interface for modifying surface and atmospheric optical parameters, which were fed to the Vector Linearized Discrete Ordinate Radiative Transfer (VLIDORT) [67] RT code to calculate the top of atmosphere (TOA) radiances. UNL-VRTM has been widely used for aerosol retrieval and relevant sensitivity studies based on both band-averaged and hyperspectral radiances [68][69][70][71]. To exploit the full information possible for NO 2 retrieval in the UV-Vis measurements, we simulated radiances between 320 and 500 nm, with an interval of 0.01 nm. The simulations were run in scalar-only mode to accelerate the calculation by an order of magnitude across >18,000 wavelengths. Polarization correction factors for each wavelength were interpolated from an additional vector-mode run at 21 wavelengths following [54]. The simulated radiances (I) were then convoluted with the TROPOMI spectral respond function (SRF), normalized to solar irradiance (E), and then, recorded as log base values, to mimic the hyperspectral observations (L λ = ln ðI λ /E λ Þ) at~0.2 nm spectral resolution in conventional retrievals [29,62,72]. The input variables of UNL-VRTM in each simulation varied randomly (sampled from uniform distributions) within their realistic ranges according to Table 1. First, solar and satellite position angles were varied in the simulations: solar zenith angle (SZA), satellite/view zenith angle (VZA), and relative azimuth angle (RAA). Surface reflectance spectra (R λ ) were randomly obtained from a land climatological dataset as a combination of MERIS and OMI ( Figure S1). The whole atmosphere from sea level to 80 km (0.01 hpa) was divided into 47 layers (30 layers below the assumed tropopause at 12 km) following the hybrid sigma-pressure vertical grid setting in the GEOS-Chem model [51], with the temperature and pressure in each layer consistent with the US standard (1976) atmosphere. The actual layer number in each simulation depended on the surface altitude (H s ) inputs (varied between 0 and 8 km). For aerosols, the size/microphysical properties (water-soluble aerosols from the OPAC dataset [73]) and vertical profile (exponentially decreasing with a scale height at 1 km) were fixed while the optical depth varies in each simulation. Apart from NO 2 , vertical columns (G ot ) of four additional absorbing gases (SO 2 , O 3 , H 2 O, and CH 2 O) also varied with their fixed profile shapes [74] from the US standard (1976) atmosphere encoded in the UNL-VRTM. The O 2 -O 2 absorption was also included and implicitly considered based on the pressure in each layer. Raman rotational scattering (Ring effect) [75] is not considered during the RT simulation for this wide spectrum, and we applied the Ring correction (Section 3.1) to ensure consistently Ringfree L λ in both the simulation and observation.
The NO 2 vertical profile is modeled as composed of two parts: a stratospheric component with fixed vertical distribution following the US standard profile shape between 12 and 80 km, and a tropospheric component with a quasi-Gaussian 3 Journal of Remote Sensing shape [52,53] between the surface and tropopause (fixed at 12 km). The half width of the tropospheric plume profile is a random number between 0.4 and 1 km for each case. The distribution of NO 2 abundance over different regions is strongly uneven, and heavy pollution (i.e., tropospheric N O 2 > 1016 molecules/cm 2 ) is more likely to occur over urban areas and near the ground [76]. To ensure representativeness, the whole training dataset was also separately  Figure 1: Illustration of the neural network model generation to predict total NO 2 column density from satellite radiances. Box 1 includes the details about relevant variables driving the RT simulations (Section 2.1), with parameters in dark blue (red) being potential predictors (predictands) in the NN model; box 2 shows the structure of the NN model; and box 3 presents the conceptual illustration of crossvalidation. The total NO 2 column (NO 2 C) is the sum of stratospheric (SC) and tropospheric (TC) column. For L λ and R λ , the actual inputs in the NN are their associated PC scores (Sections 2.2 and 3.2). Other absorption gases (G ot ) and aerosol optical depth (AOD) varied according to Table 1 in the simulation while are not included as predictors in the final NN model according to sensitivity tests in Section 2.4. Table 1: Input variables and their corresponding data ranges or sources in the RT simulation (DU is Dobson unit, 1 DU = 2:69 × 10 16 molecules/cm 2 ). Each simulation adopts a random number for each variable. See Section 2.1 and Figure 1 for the definition of each abbreviated variable.
Variable group NN variable Range (source) Notes Figure S2 about four sub-groups Journal of Remote Sensing generated for four subgroups ( Figure S2), as defined based on specific ranges of tropospheric NO 2 column (TC) and centroid height (TH), as well as H s . Specifically, the simulations were dominated by urban polluted (low H s , low TH, and high TC) and remote clean (random H s and TH, low TC) scenarios, plus some elevated NO 2 enhancements due to lightning and biomass burning (random H s , high TH, and high TC), and a small number of polluted cases over high altitude surface (high H s , low TH, and high TC). TH, TC, and H s varied randomly within the specific range (Table 1) in each subgroup. Our training set contains 51703 samples with H s > 2 km and bright surface (R λ > 0:6 with <3% spectral variability), which could represent high-altitude snow or cloud surfaces. We directly use the NN from the whole training set (Section 2.2) for retrieving above-cloud NO 2 C under fully cloudy assumptions (see details in Section 3.2). As this study was not designed to perform retrievals for cloudy scenes where NO 2 retrievals are highly uncertain, and we focus on discussing the performance of retrievals under clear-sky conditions (Sections 3.3 and 3.4), the lack of cloudy-scene specific retrievals does not alter the conclusions of this paper. Future efforts to improve the treatment of cloudy scenes in the RT simulations and NN retrievals are needed to assess the ability of the NN to describe these scenes (see also discussion in Section 4).

NN Training and Evaluation.
The overall structure of the feed-forward NN used in this study [77] is defined in the box 2 of Figure 1, which composes of an input layer of predictors (dark blue), an output layer of predictands (red), and several hidden layers to mimic their relationship. Each hidden layer includes several computational nodes (neurons, N), and each neuron is modeled as a nonlinear activation function (f ) of the weighted (the weights denoted as w) sum of all neurons in the previous layer plus an offset (b): where i is the layer index and k ðjÞ is the node index in layer i ði + 1Þ. The training and application of the NN was done using the Python scikit-learn (sklearn. neural_network.ML-PRegressor) package [78], which attempts to optimize these weights (w) and offsets (b) to minimize the least square difference between predicted and true predictands in the training data. Our numeric tests favored an NN configuration of two hidden layers, with 16 nodes in the first layer and 8 nodes in the second, that had reached the optimal model prediction power. Further increasing node or layer numbers did not improve the performance. We define the predictand in our study as the total NO 2 vertical column (NO 2 C = TC + SC) in the training dataset, reflecting the unknown NO 2 stratospheric/tropospheric separation in real-world retrievals. Trials on retrieving SC or TC were also tested. The performance was significantly weaker, reflecting the low information content about vertical location of NO 2 in most of the satellite observed radiances.
Further processing of the retrieved NO 2 C using various available algorithms [31,32] could separate the stratospheric and tropospheric component, which is beyond the scope of this study. Inherently, all the remaining variables in the RT simulation (i.e., dark blue terms in box 1 of Figure 1) plus the satellite radiances (L λ ) are potentially predictors of the NN model, from which we will determine their actual contribution to model prediction power and select the employed predictors in the final NN model (Section 2.4).
Within the predictors, the spectral surface reflectance (R λ ) from the climatology in Figure S1 and the simulated satellite radiances (L λ ) data contains a vector of 20 and up to 900 wavelengths, respectively, and most of these spectral observations are correlated [79], implying information redundancy. We followed previous studies [62,72] to use the principal component analysis (PCA) technique to reduce the dimension of these variables and simplify the NN model training while maintaining the information content. The PCA was conducted using the sklearn.decomposition.PCA routine in scikit-learn [78]. We found that the top three leading PC scores of R λ could reproduce the full (i.e., >99.99%) variability of the employed monthly climatology over land, which were used in the actual NN model experiments. For L λ , the 15 leading PC scores could explain the full variability in all the training data, and we also only selected several key PC scores that are the most relevant for NO 2 column prediction in the final NN model (Section 2.4).
We used the 5-fold cross-validation technique [64,80] to evaluate the theoretical model performance for predicting NO 2 C in the training dataset (box 3 of Figure 1). Specifically, the whole training data samples were divided into 5 equalsized groups. In building the NN model (fold 1), the first 20% of the data was used as the evaluation data, and the remaining 80% was used for training. This was repeated five times until the data were fully covered, with every NO 2 C record containing a truth and prediction pair, from which the overall evaluation statistics (e.g., R 2 ) were calculated. The cross-validation helps to minimize bias in training data selection and ensures the representativeness of the evaluation metrics of the whole training records.

Determination of Optimal Retrieval Spectral
Range. The spectral windows for previous NO 2 retrieval were mostly near the peak NO 2 absorption (i.e., the absorption cross section σ) around~430 nm, with the width largely increasing from GOME (425-450 nm) [14] to TROPOMI (405-465 nm) [29]. However, the benefit of wider spectral windows might also be compensated by the uncertainty of other absorptions (especially further from the σ peak) and reduced validity of constant slant column in the spectral fitting [30,55]. Here, we apply the NN training and cross-validation as a pure data-driven tool to investigate which spectral range provides the maximum useful information.
Using all the potential predictors (dark blue terms in the box 1 of Figure 1 plus the leading 15 PC scores of L λ ), we conducted the NN training and cross-validation adopting radiances covering different spectral ranges. We repeated the experiment 5 times (i.e., vertically aligned circles in 5 Journal of Remote Sensing Figure 2) for each retrieval window to reduce the impact of NN's internal uncertainty of numeric solution. The results below are based on L λ from the SRF of the 200th TROPOMI detector row, which represent consistent findings from other rows as expected. Figure 2 shows the evolution of the coefficient of determination (R 2 ) following the gradually sliding spectral window of L λ , and Figure S3 shows the corresponding root-meansquare-error (RMSE). The synthetic data was divided into low-NO 2 (NO 2 C ≤ 0:4 DU, (a, b)) and high-NO 2 (NO 2 C > 0:4 DU, (c, d)) subgroups for training. Fixing the left wavelength at 390 nm and varying the max wavelength of the included radiances ((a, c)), we can see a trend of continuously increasing R 2 following the wider spectral range, demonstrating the stronger NN model capability to predict NO 2 C from extraspectral information for both lowand high-NO 2 cases. For low-NO 2 , the model improvement vs. wavelength range slows after~465 nm, with additional RMSE reduction of~0.50% from 465 to 495 nm, compared to the reduction of~1.85% from 430 to 465 nm. This corresponds to a sharp reduction of NO 2 absorption (σ < 5 × 10 −19 cm 2 ), suggesting strong dependence of the model performance on absorption strength rather than window width. Under high-NO 2 scenarios, the model continues to improve (RMSE reduction of~0.75% from 465 to 495 nm vs.~1.95% from 430 to 465 nm), suggesting additional information from extended spectral width is useful despite weaker NO 2 absorption and interferences of other absorptions, e.g., the strong O 2 -O 2 absorption at 477 nm. As previously mentioned, satellite radiances contain sensitivity to the NO 2 vertical location under polluted conditions due to spectrally dependent scattering (i.e., AMF) [57], which is consistent with this more robust increase of R 2 vs. broader spectral range under high-NO 2 scenarios.
However, retrievals under low-NO 2 have a systematic~2% smaller R 2 than under high-NO 2 conditions, due to the lack of vertical sensitivity that affects the model performance at less-absorbing spectral regions (i.e., weaker ability to distinguish between relatively lower NO 2 C aloft and higher NO 2 C near the ground). Based on this continuous improvement, we take the ending wavelength in the final NN model as 495 nm, the end of the range in the TROPOMI Band 4 detector.
Similarly, fixing the ending wavelength at 495 nm and sliding the left wavelength (Figures 2(b) and 2(d)) consistently suggests more stable increase of R 2 following wider spectral range used for the high-NO 2 cases, until the min wavelength reaches around 390 nm. Below 390 nm, the interferences from other gases, aerosols, and surface albedo result in plateauing (under high NO 2 ) or even decrease (under low NO 2 ) of R 2 . Although the variabilities of these additional gases were included in the NN predictors, they appear not to compensate for the reduction of NO 2 signal over these wavelengths. Including wavelengths below 360 nm under low NO 2 is unstable, as indicated by the increasing scattering of R 2 among 5 repeated experiments. We therefore adopted the 390-495 nm range as the optimal retrieval window for the final NN model configuration. This retrieval window is determined assuming uniform performance of the measured spectral radiance. Inconsistencies between the Band3 and Band4 detectors, as well as degraded precision towards the end of Band4 (e.g., >492 nm), are expected in realistic measurements. We emphasize the general usefulness of the extended information from wider spectral range in realistic retrievals (see Section 3.3 and Figure S8), while a more comprehensive optimization of retrieval window warrants further investigation and should vary from sensor to sensor.  Journal of Remote Sensing Figure 2 suggests qualitatively contrasting sensitivities of satellite radiances to the spectral range (i.e., NO 2 vertical information) and a necessity of separate consideration, between high-and low-NO 2 cases in the NN model generation. More detailed tests (e.g., Figure S4) consistently indicated that including training records with both highand low-NO 2 environments resulted in unexpectedly decreasing R 2 vs. wider spectral range over 390-430 nm, reflecting the two different sensitivity regimes that cannot be jointly resolved by one NN model. We therefore separately constructed the NN models for low-and high-NO 2 cases.

Determination of Sensitive Predictors.
Numerical methods like NN tend to be subject to overfitting, i.e., the variability of unrelated predictors falsely contributing to the model prediction. We selected predictors that contribute significantly to NO 2 C prediction, using an additive approach. We first ranked the importance of each predictor by the cross-validation R 2 when only that predictor is used in the NN model. We then perform NN training and cross-validation iteratively, by adding one predictor (from the most to least important) at each iteration. The three GEO angles and R λ PC scores were considered as two variable groups in the experiment. The improvement to R 2 by adding each predictor is used to evaluate its importance. Again, we exemplify the results below based on L λ over 390-495 nm and from the SRF of the 200th detector row. Figure 3 shows the changes in cross-validation R 2 in the additive NN experiment. The key variables (labeled in red) that contribute to the model prediction power were selected to build the final NN models. For low-NO 2 cases, the 7th and 8th PC scores of L λ from the radiances together could explain nearly 80% of the variability of NO 2 C via NN, which is further improved by 4-9%, respectively, by adding the 6th PC score, GEO angles, and R λ . For the high-NO 2 model, the 5th PC score of L λ alone can explain more than 80% of the total NO 2 C variability, and GEO and R λ are also significant variables. We found that the 5th PC loading (R = 0:88) extracted from L λ of high-NO 2 cases, and the 7th (R = −0:68) and 8th (R = 0:55) PC loadings from the low-NO 2 training set are all highly correlated with the NO 2 σ. The "earlier appearance" of important PC (i.e., 5th compared to 7th and 8th) in the high-NO 2 model can be explained by the stronger contribution to the variability of L λ in the training set. Finally, the 13th PC score appears to be the most relevant variable to predict TH in a onepredictor NN test, explaining~15% TH variability in cross-validation. Hedelt Figure 3: Changes in cross-validation R 2 as employing increasing number of predictors in the NN training for low-NO 2 (a) and high-NO 2 (b) cases. The additional variability explained after adding each variable (group) was indicated in red above each waterfall bar, and PC# denotes the #th PC scores or L λ . All the variables in the shaded area introduce <1.5% additional variability according to the experiment. 7 Journal of Remote Sensing satellite hyperspectral radiances. Reasonably, we can see that this predictor also helps improve the prediction of NO 2 C in high-NO 2 cases.
The importance of surface altitude (H s ) was found different in the low-NO 2 and high-NO 2 NN model due to their different vertical sensitivities. Because the sensitivity to tropospheric NO 2 height is weak, the retrieval under low-NO 2 also becomes almost insensitive to H s . On the contrary, sensitivity to NO 2 altitude is more pronounced in satellite radiances under high-NO 2 environments, where accurate H s information also becomes significant. Variabilities of AOD, other absorption gases (G ot ), and other PC scores of L λ were found to contribute <1.5% of additional predicted NO 2 C variability and are therefore not included in the final NN model.

Application to TROPOMI
The NN model built from the previous section can be applied to TROPOMI radiances and complimentary data to test the validity of retrieved NO 2 C. TROPOMI is a hyperspectral backscattering sensor onboard the sun-synchronous Sentinel-5 Precursor (S5P) satellite, achieving daily global coverage with an afternoon (13 : 30) overpass. TROPOMI has a wide swath of 2600 km and horizontal resolution of 5:5 km × 3:5 km (since August 6, 2019). Across the two UV-Vis Bands (Band 3: 320-405 nm and Band 4: 405-495 nm), the measured radiances have a spectral resolution of 0.2-0.4 nm and a signal-to-noise ratio of~1000. These characteristics enable TROPOMI to provide unprecedented urban-scale monitoring of atmospheric composition. Retrievals of the level 2 (L2) operational TROPOMI NO 2 product follow the typical three-step process [13,29,81], in which the TM5-MP model is used to feed the a priori NO 2 profiles for an AMF calculation. Validation of the L2 products suggest a systematic underestimation of L2 NO 2 C by 30% against global Pandora measurements at high-NO 2 (median NO 2 C > 9 × 10 15 molecules/cm 2 ) [82].
Details about datasets and variables used in the retrieval are summarized in Table S1. The data processing flowchart is summarized in Figure 4 and Sections 3.1 and 3.2. We then present a comprehensive evaluation of the retrievals over four months (September/December 2019 and March/ June 2020) and three source regions (East Asia, Europe, and North America, Figure S5), using independent satellite and ground-based dataset in Sections 3.3 and 3.4.

Wavelength Calibration and Ring Correction.
We first calculated L λ from TROPOMI L1B solar irradiance (E) and earthshine radiance (I) data (Bands 3 and 4). Since the whole Band 4 spectra and part of Band 3 (390-495 nm) are used, the wavelength shift and stretch terms for wavelength calibration in TROPOMI standard level-2 (L2) NO 2 data (405-465 nm) are not suitable, and we repeated the calibration process before L2 NO 2 retrievals [29] in this study. Briefly, the shift and stretch terms were first determined for each day and detector row using the solar irradiance observation and reference spectrum (E 0 ). Then, a further shift (w λ ) and Ring coefficient (C ring ) for each TROPOMI pixel were fitted to where λ n is the wavelength nodes after the first-step irradiance-based calibration, P is a 3rd-order polynomial function of wavelength that implicitly accounts for the effects of slit function and other errors during the radiometric calibration, I is the radiance intensity measured by TRO-POMI, and I ring /E 0 is the simulated sun-normalized Ring spectrum. The employed solar and Ring reference spectra (Table S1)    Journal of Remote Sensing As we did not include the Raman rotational scattering in the RT simulation, the correction of Ring effect based on the fitted C ring is necessary.
3.2. Visible-Only NO 2 C Calculation. The TROPOMI L λ could then be applied to the NN model to determine NO 2 C. For each TROPOMI pixel, we retrieved four NO 2 C values, differing by the adopted NN model for high-NO 2 or low-NO 2 conditions and the assumption of pure clear-sky or fully cloudy scene (Figure 4). Under the clear-sky assumption, the R λ (calculated using GEO and MODIS BRDF product [83] at 470 nm and spectral shape from the climatology introduced in Figure S1) and H s (from the GMTED 2010 [84] global elevation dataset) described the surface; while for fully cloudy retrievals, R λ (cloud albedo with flattened spectral shape) and H s (converted from cloud pressure) were from the FRESCO+ [81,85] cloud retrievals using the NO 2 band of TROPOMI (available in the L2 NO 2 product). The PC scores of L λ ðR λ Þ were then derived using the precalculated eigenvectors from the training sets (L λ climatology), before being applied to the NN models.
The NO 2 C values determined from applying the high-NO 2 and low-NO 2 NN models, respectively, were first combined under both the clear and cloudy assumptions: This merging strategy was inferred from the 2D density plot of NO 2 C from both NN models, which presents a two-mode distribution ( Figure S6). One of the two clusters (i.e., Zone I of Figure S6) has retrievals <0.3 DU from the low-NO 2 NN while the NO 2 C predicted by the high-NO 2 NN is not covarying but scattered. This NO 2 C from the high-NO 2 NN in this regime is likely irrational considering most pixels are not over emission sources and should have low NO 2 C, and NO 2 C from the low-NO 2 NN is taken. A smaller number of NO 2 C near the other mode (Zone II) are jointly supported by the low-and high-NO 2 NN retrievals (i.e., >0.4 DU), and the NO 2 C from the high-NO 2 NN is selected due to its representativeness. We set a buffer NO 2 C (in DU) in Equation (5) to determine a narrow merge zone (i.e., Zone III in Figure S6). Over this regime, the two retrievals were considered indistinguishable and weightedaveraged, with the weights linearly varying between Zones I and II. Zone IV contains retrievals that both NN models predict NO 2 C out of its represented ranges, which rarely occurred in the results. Tests varying this buffer NO 2 suggested that a buffer of 0.08 DU resulted in the best agreement vs. ground-based (i.e., Pandonia and MAX-DOAS) measurements in Section 3.3, and this is adopted in the rest of the processing.
Our RT simulation did not explicitly consider cloud scattering and absorption between the bottom surface and TOA due to the strong opaqueness of clouds over the UV-Vis [86]; therefore, the NO 2 C from the NN model under both the clear-sky and fully cloudy assumption is "visibleonly" columns (i.e., total column above either the surface or the cloud layer). We merge these two retrievals based on the cloud radiance fraction (F r ) to generate the final NO 2 C amount: Equation (6) is similar to the processing in the L2 product, where the clear-sky and cloudy AMFs were assumed linearly weighted based on F r [81]. Although operational retrievals (e.g., the L2 NO 2 ) report total column above the surface under cloudy scenes, the information under the cloud layer is still contributed by the modeled a priori profile. As our purpose is to relax that reliance, we do not attempt to estimate the "ghost column" under cloud, rendering the NO 2 C from our approach a pure radiance-based retrieval.
The TROPOMI spectral responses vary subtly between detector rows (each with own SRFs). We generated NN models based on trainings using SRFs of every 50th detector row, and linearly interpolated the retrieved NO 2 C to the untrained rows following Hedelt et al. [62]. This interpolation approach was confirmed applicable as no artificial row-based striping was detected in the output daily NO 2 C maps.
3.3. Evaluation with the Pandonia Global Network. The Pandonia Global Network (PGN) is the main source of ground truth for our evaluation. PGN provides direct sunview measurements of total vertical column of various trace 9 Journal of Remote Sensing gases from ground-based standardized Pandora Sun photometers. The Pandora NO 2 C has a random error of~0.05 DU (1:35 × 10 15 molecules/cm 2 ) and has been widely applied to the validation of satellite NO 2 retrievals [82,87].
The PGN NO 2 C were collocated with TROPOMI NO 2 C, using conventional spatial and temporal proxy criteria (i.e., the closest pair within 30 min in time and 30 km in distance). All the available TROPOMI retrievals without quality filtering were applied to the collocation due to currently unavailable quality assignments in the NN retrievals (which requires longer-term data collection and more comprehensive evaluations). The evaluation conclusions were also consistent if only selecting high quality PGN data, except for the significantly reduced number of available retrievals under cloudy conditions (i.e., comparing Figure 5 and Figure S7). We present in Figure 5 those without the selection, consistent with the evaluations over China (Section 3.4) that are without available quality flags. Figure 5 shows the evaluation of the NN retrievals and the TROPOMI L2 NO 2 C vs. collocated PGN measurements. Both retrievals partially reproduce the range of PGN NO 2 C, with overestimations at low NO 2 C and underestimations at high NO 2 C. This was similarly observed in a more comprehensive evaluation of the L2 retrievals [82], and could be partially explained by the TROPOMI FOV that smears the Pandora measured local air mass. Interestingly, the two retrievals show similar distribution of performances over different regimes, as jointly determined by NO 2 C and geometric cloud fraction (F g ). Only over high NO 2 C (≥1:1 × 10 16 molecules/cm 2 ) and low cloudiness (F g < 0:2) conditions do both retrievals have relatively strong correlations vs. PGN (R 2 > 0: 48), whereas under the other conditions, both retrievals exhibit low correlations (R 2 < 0:16). The distributions of R 2 and normalized mean bias (NMB) from the two independent retrievals highlight the uncertainty of the model simulated a priori profile and indicate the profiles only marginally compensate for the weak NO 2 signal.
Without the aid of modeled NO 2 under clouds, the NN retrievals exhibit stronger degradation than the L2 NO 2 C under cloudy conditions (F g ≥ 0:2). The NN predicted NO 2 C also has weaker performances vs. L2 under low NO 2 C (i.e., <7:5 × 10 15 molecules/cm 2 ) because of the previously mentioned reduced vertical sensitivity (Section 2.3). We identified an "optimal NN regime" (ONR, NO 2 C ≥ 7:5 × 10 15 molecules/cm 2 and F g < 0:2, i.e., shaded in Figure 5), where the NN retrievals reveal robust ability Marker sizes reflect the number of pairs in each bin, and markers with a black outline indicate better performance (e.g., larger R 2 or smaller NMB) than the other retrieval. The shaded area is the "optimal NN regime." 10 Journal of Remote Sensing (i.e., close or slightly better R 2 ) similar to the L2 products, to reproduce the PGN NO 2 C. Within this regime, the NN retrievals have an overall R 2 of 75% and NMB of -33%, both comparable to the L2's R 2 of 77% and NMB of -29%, whereas out of the ONR, the overall R 2 of NN is <1% compared to the L2's R 2 of 17%. Therefore, for the ONR scenario, the extended spectral range provided the trained NN model additional observation-based information to resolve NO 2 C variability, that is competitive to a model simulation of a priori NO 2 profiles in L2. This extended spectral information confirmed the contribution to the NN retrieval of the radiances of the TRO-POMI Band 4 (401-495 nm, Figure S8). The reduced vertical sensitivity resulted in systematic underestimation of NO 2 C vs. PGN across almost all the bins, mainly due to the smaller number (~50%) of high NO 2 C (>1 × 10 16 molecules/cm 2 ) pixels. As we discussed in Section 2.3, the vertical sensitivity of the high-NO 2 NN model increases with the spectral width of included retrieval window (Section 2.3). Even the~11 nm spectral information in Band 3 is useful for the NN-based retrieval. We expect the retrievals using the spectral information adopted in L2 (405-465 nm) to perform worse than in Figure S8 if relying on a pure NN-based approach.

Evaluation with Complementary NO 2 Measurements over China.
The PGN network is mostly located over developed regions with smaller NO 2 C ( Figure S5). Since the NN retrievals reveal a strong dependence on NO 2 abundance, it is also meaningful to evaluate the applicability of the NN retrievals over developing areas with stronger NO 2 pollution (e.g., China).
Over wide areas of China without PGN sites, we used surface NO 2 concentration and MAX-DOAS tropospheric NO 2 column measurements for the evaluation of TROPOMI NO 2 C ( Figure S5). The hourly surface NO 2 concentrations are from the routine monitoring by the Ministry of Ecology and Environment (MEE) of China [88]. The MAX-DOAS measurements [41,89] are from three suburban stations (Beijing, Xuzhou and Nanjing, respectively, from North to South in Figure S5). Figure S9 shows the comparisons of the retrievals against MEE surface NO 2 over China. With a wider dynamic range of NO 2 C (i.e., 0 − 6 × 10 16 molecules/cm 2 ) more frequent occurrence of scenarios favorable for NN retrievals (i.e., within the ONR regime), the NN shows competitive performance vs. the L2 for resolving surface NO 2 variability. We observe similar R 2 for the two retrievals (0.43 and 0.42) in near clear-sky conditions (F g < 0:2). The NN underestimates surface NO 2 in many cases where the L2 does not, forming a discernable population ( Figure S9, right), which likely corresponds to cases where NO 2 is strongly concentrated near ground. The underestimation by the NN is consistent with the weaker overall signal of near-ground NO 2 ; thus, higher retrieved NO 2 C from the NN still provides a robust indicator of its data quality. For cases with high TROPOMI NO 2 C, the R 2 of NN retrievals surpasses that of the L2 product (e.g., above the threshold NO 2 C ≥ 6 × 10 15 molecules/cm 2 , Figure S10). Differences between the two retrievals increase with the NO 2 C threshold. This reveals the meaningful vertical sensitivity in the TROPOMI radiances penetrating into the boundary layer that gradually becomes advantageous against the modeled profiles used in L2 for more polluted cases. The NN retrievals are thus especially applicable for estimation of surface NO 2 over China, and other conditions with very high NO 2 amounts. Figure S11 shows the comparison against MAX-DOAS tropospheric NO 2 columns. Under F g < 0:2, both retrievals are highly correlated (R 2 > 0:75) with the MAX-DOAS measurements and are systematically biased low (although the MAX-DOAS data only represents the tropospheric NO 2 C). Nonetheless, the NN exhibits less underestimation than the L2 product. We found that switching the L2 dataset from total NO 2 C to tropospheric NO 2 column slightly deteriorated the L2 performance. Figure 6 presents maps of retrieved NO 2 C over the East Asia domain, from the two independent retrievals. There are consistent spatial distributions of NO 2 C from both products under near clear-sky conditions (i.e., F g < 0:2). NO 2 C enhancements are noticed over anthropogenic polluted and heavily industrialized regions, including the North/Northeast China Plain, Inner-Mongolia and Shanxi coal industry center, Fenwei Plain, Sichuan Basin, Yangtze River Delta, Pearl River Delta, Wuhan, Seoul, and Tokyo Metropolitan regions; the rest of the domain is dominated by background NO 2 C. The enhanced regions in the NN retrievals ("visibleonly" column) are more localized than the L2 ("full" column) product (both with F g < 0:2). The background NO 2 C as determined by the NN retrieval are 1-2 × 10 15 molecules/cm 2 higher than the L2 product, consistent with its weaker performance and overestimation of NO 2 C at NO 2 C < 5:5 × 10 15 molecules/cm 2 ( Figure 5).
The rightmost panels of Figure 6 show the comparison of all collocated daily NO 2 C from both products. Two clusters are notable in the scatterplots, one with smaller F g and tight correlations around the 1-1 line, another with enhanced cloud presence and weakened covariation between the two retrievals (where the visible-only NO 2 C from NN is systematically lower). For cases with F g < 0:2, the two products show higher correlations over more severely polluted months (i.e., March and December, R 2 > 0:65) than relatively cleaner months (i.e., September, R 2 < 0:35). Such enhanced correlation at higher NO 2 concentration again confirms the inherently stronger reliability of both retrievals under polluted and near clear-sky conditions ( Figure 5).
In summary, the evaluation of both TROPOMI NO 2 C retrievals over China confirms that NN retrievals are accurate over polluted regions, with promising applicability to developing areas around the world.

Discussion and Conclusion
It has been proposed that extending the spectral range of NO 2 retrievals introduces additional vertical sensitivities [30,55,57], which might facilitate the retrievals without 11 Journal of Remote Sensing modeling the a priori profile-the most time-consuming step. In this study, we used NN as a data-driven tool to investigate this issue from a new perspective and quantitatively tested this idea. To the best of our knowledge, this is the first attempt at a satellite NO 2 retrieval of this form. Our main findings are as follows: (1) The NN training experiments confirmed the existence of vertical sensitivity in the satellite radiances utilizing a broader spectral range. The retrieval sensitivity exhibits qualitative contrasts between highand low-NO 2 cases, indicating a necessity of separate considerations (2) The retrieval sensitivities have a reasonable spectral distribution consistent with the relative strength of NO 2 absorption vs. other interfering trace gases, resulting in an optimal retrieval window of 390-495 nm for TROPOMI (3) Despite many sources of variabilities in the inputs of the RT simulation, only the key factors (observing geometry, surface reflectivity, surface altitude, and several key PC scores of satellite radiances) significantly contribute to the NO 2 C prediction in the optimized NN (4) An application and evaluation of the NN model to TROPOMI reveals that NO 2 C from the NN retrievals have competitive accuracy relative to the L2 product. The NN retrievals resolve NO 2 C variation under less cloudy and more polluted scenarios (NO 2 C ≥ 7:5 × 10 15 molecules/cm 2 and F g < 0:2). In other environments, both the NN and L2 retrievals show distinctive degradation, and the NN retrievals without a priori profiles become less reliable than L2. These findings are consistent with the theoretical variation of retrieval sensitivity, as is also revealed in the NN training experiments Over populated areas, stratospheric NO 2 of 2-5 × 10 15 molecules/cm 2 are persistent [32,33]. Therefore, this study indicates that the NN retrieval can track tropospheric NO 2 pollution of as low as~3 × 10 15 molecules/cm 2 at similar precision as the L2 data under clear-sky conditions, without the need to simulate a priori NO 2 profiles. This conclusion is especially promising for future retrievals using geostationary satellite observations over polluted areas, where the data volumes per day are over 10 times that of polar-orbiting satellites. The wall time for a retrieval using the NN and a 30-core node for 1-month hourly NO 2 C at~3 km resolution (typical resolution of future  Figure 6: Maps of monthly NO 2 C from the NN retrievals and from the TROPOMI L2 operational product over East Asia, with each row representing one month of the retrieval experiment. Only cases with F g < 0:2 are included in the NN retrievals (first column), while the L2 products were derived from both near clear-sky (second column) and all-sky cases (third column). The rightmost column shows the scatter plots of all the collocated daily NO 2 C from both products, color-coded by the corresponding F g , with the dashed 1-1 line and the R 2 value for cases with F g < 0:2.
Geostationary instruments) and over the domain shown in Figure 6 is~20 hours,~12 times faster than the time required for model simulation of the a priori profile (~10 days). The NN time is dominated by the wavelength calibration (Section 3.1).
For an efficient and effective retrieval for geostationary instruments such as TEMPO [27] or GEMS [26], additional research is needed to improve the machine learning-based NO 2 retrievals. Key ideas to be explored include: (1) Extend the training set to represent additional realworld scenarios. The fixed stratospheric NO 2 profile, aerosol optics, and tropopause height in the RT simulations could be better customized to potentially improve performance in regions and seasons where the assumptions were less sound. More sophisticated sampling of RT input variables (e.g., from model simulation over the retrieval domain) other than random sampling should be explored to increase the representativeness. In addition, the separate retrievals assuming clear and fully cloudy conditions could be improved by directly simulating the observed radiances under partial cloudy conditions. Alternative forms of NN structure, such as replacing the radiance PC scores with SCDs after spectral fitting should also be investigated to evaluate which contains more useful information. Finally, the emergence of additional ground-based NO 2 C sites and data will facilitate an observation-based training set, as an alternative to RT simulations (2) Explore complementary datasets and methods where pure NN-based retrievals are less favorable (i.e., out of the ONR). This will enhance the potential utility of the retrieval products. Adding a climatological background NO 2 or a simulated ghost column, for example, could strengthen the robustness of the retrievals. To separate the stratospheric/tropospheric NO 2 columns, training additional NNs for remote/ oceanic scenes or exploring the application of similar L2 stratospheric/tropospheric separation approach should be investigated (3) The vertical sensitivity over a wide spectral range indicates the possibility of retrieving NO 2 vertical location under heavily polluted conditions. This possibility should be further exploited, and will be valuable for studies of lightning, biomass burning, and the vertical variation of tropospheric chemistry and for evaluating the a priori profiles in operational retrievals

Data Availability
The neural network retrieval code with necessary instructions is available at GitHub (https://github.com/ChiLi90/ ANNNO2). One-day of input data for testing the code is available at https://mega.nz/file/c10wGKJT#GkY6_ HCDLIM88T5vlO7P26bL9oc53Fj7N0z3-oGKp58. Table S1. Information of used dataset and variables. Figure  S1. Surface spectral reflectance (Rλ) climatology used in this study Figure S2. Number of cases for subgroups in the training dataset Figure S3. Similar to Figure 2 but for the root-mean-square-error Figure S4. Cross-validation R2 for different retrieval windows Figure S5. The boundaries of 3 retrieval domains and used surface sites Figure S6. Illustration of the merging method of NO2C Figure S7. Similar to Figure 5 but limiting to cases with high quality PGN measurements. Figure S8. Similar to Figure 5 but from a separate NN retrieval only using TROPOMI Band 4 Figure S9. Density plot of collocated TROPOMI NO2C and surface NO2 over China Figure S10. Evolution of R2 of TROPOMI NO2C vs. surface NO2 over China Figure S11. Scatter plot of TROPOMI NO2C and MAX-DOAS tropospheric NO2 column. (Supplementary Materials)