Decentralized Distributed Deep Learning with Low-Bandwidth Consumption for Smart Constellations

For the space-based remote sensing system, onboard intelligent processing based on deep learning has become an inevitable trend. To adapt to the dynamic changes of the observation scenes, there is an urgent need to perform distributed deep learning onboard to fully utilize the plentiful real-time sensing data of multiple satellites from a smart constellation. However, the network bandwidth of the smart constellation is very limited. Therefore, it is of great signi ﬁ cance to carry out distributed training research in a low-bandwidth environment. This paper proposes a Randomized Decentralized Parallel Stochastic Gradient Descent (RD-PSGD) method for distributed training in a low-bandwidth network. To reduce the communication cost, each node in RD-PSGD just randomly transfers part of the information of the local intelligent model to its neighborhood. We further speed up the algorithm by optimizing the programming of random index generation and parameter extraction. For the ﬁ rst time, we theoretically analyze the convergence property of the proposed RD-PSGD and validate the advantage of this method by simulation experiments on various distributed training tasks for image classi ﬁ cation on di ﬀ erent benchmark datasets and deep learning network architectures. The results show that RD-PSGD can e ﬀ ectively save the time and bandwidth cost of distributed training and reduce the complexity of parameter selection compared with the TopK-based method. The method proposed in this paper provides a new perspective for the study of onboard intelligent processing, especially for online learning on a smart satellite constellation.


Introduction
With the breakthrough development of artificial intelligence and the rapid improvement of onboard computing and storage capabilities, it is an inevitable trend for remote sensing satellite systems to directly generate information required by users through intelligent processing onboard [1,2]. As earth observation scenes usually present high dynamic characteristics, the traditional training on the groundprediction onboard working mode cannot satisfy the realtime and accurate perception requirement of users. There is an urgent need to learn and update the intelligent model onboard to adapt to the dynamic changes of the scenes.
Affected by factors such as satellite orbits, payloads, physical characteristics of target objects, and imaging methods, more and more intelligent tasks, such as emergency observations in disaster areas and searching for missing Malaysia Airlines, require the cooperation of multiple satellites. Therefore, relying only on the observation data of a single satellite makes it difficult to achieve precise learn-ing of the global intelligent interpretation model for these cooperation tasks.
Benefited from the development of satellite technology and the reduction of satellite development cost, the number of satellites in orbit has increased sharply and the intersatellite networks have been gradually established, which build the foundation for multisatellite collaboration or smart constellation. Based on the collaborative working mode, it is available to integrate the real-time sensing data and computing capabilities of multiple satellites through distributed deep learning technology. Compared with learning the intelligent model on only one satellite, distributed deep learning can achieve the overall optimization and global convergence of the intelligent model without global information or human intervention and thus improve the collaborative perception and cognitive capabilities of the space-based remote sensing system. Depending on how the tasks are parallelized across satellites, the distributed training can be divided into two categories: model parallelism and data parallelism [3]. Model parallelism means training different parts of networks with multiple workers, which is mainly used for training very large models [4,5]. In contrast, data parallelism refers to the strategy of partitioning the dataset into smaller splits [6] or collecting data on different devices independently, which is the scenario we study here.
However, due to the particularity of the operating environment of the satellites, which is different from the cluster system on the ground, the network bandwidth of the smart constellation is often very limited. Therefore, it is of great significance and practical urgency to develop distributed deep learning research under a low-bandwidth environment. To deal with this problem, the traditional distributed training methods can be improved from two aspects.
The first aspect is to use decentralized network structures [7][8][9]. In the traditional centralized network structure, all nodes need to transmit their trained parameters or gradients of the intelligent model to the central server, waiting for the parameter or gradient fusion, and then receive the fused parameters or gradients from the central server. Instead, the decentralized network structure removes the central parameter server and allows all nodes to exchange parameters or gradients with adjacent nodes. In this way, the pressure of communication can be shared with each node to avoid congestion and improve the real-time capability of distributed training.
The second aspect is to reduce data transmission and save bandwidth usage. This can be achieved by communication delay, quantization, and sparsification. These techniques can be used independently or in combination to develop a comprehensive distributed training framework, such as sparse binary compression [10]. Communication delay means communicating after training several batches locally instead of one batch, which reduces the frequency of communication. This technique is used in Local SGD (Stochastic Gradient Descent) [11,12], federated averaging [13], and federated learning [14]. Quantization means using a low-precision value to replace the original precise parameters. For example, QSGD [15] (Quantized Stochastic Gradient Descent) adjust the number of bits sent per iteration to smoothly trade off the communication bandwidth against convergence time. The TernGrad approach [16] requires only three numerical levels f−1, 0, 1g, which aggressively reduce communication time. DoReFa-Net [17] stochastically quantized gradients to low bitwidth numbers. This paper mainly focuses on using the sparsification technique to overcome communication bottleneck in a low-bandwidth environment. In the sparsification method, only part of network parameters or gradients is sent. For example, Alistarh et al. [18] proposed sorting the gradients in decreasing order of magnitude and truncating the gradient to its top K components. They prove the convergence of this TopK-based method analytically. Deep gradient compression [19] also uses the gradient magnitude as a simple heuristic for importance and employs momentum correction, local gradient clipping, momentum factor masking, and warm-up training to preserve accuracy. Tsuzuku et al. [20] used the variance of gradients as a signal for compression. Adacomp [21] adaptively tunes the compression rate based on local gradient activity. Amiri and Gündüz [22] considered the physical layer aspects of wireless com-munication and proposed an analog computation scheme A-DSGD (Analog Distributed Stochastic Gradient Descent).
We notice that all these methods choose the magnitude or variance as the indicator of importance, sort the gradients by importance, and then truncate the gradients to top K components. For a deep neural network with millions to billions of parameters, this process could be time-consuming due to its high complexity. In this paper, a novel method named RD-PSGD (Randomized Decentralized Parallel Stochastic Gradient Descent) for reducing communication bandwidth by parameter sparsification is proposed. Unlike existing methods utilizing TopK sparsification, in each iteration, we select the parameter to be transferred in a random way, which greatly reduces the complexity of parameter screening. We prove that this strategy can also guarantee the convergence by both theoretical and experimental analysis and optimize the programming to fully leverage the advantage of random parameter sparsification.
The remainder of this paper is organized as follows: Section 2 proposes the RD-PSGD method for smart satellite constellations, and Section 3 presents the programming optimization for the proposed method. Section 4 validates our method by experiments. Conclusions and future works are presented in Section 5.

Methodology
In this section, we first introduce the distributed training framework for smart satellite constellations. Then, based on the framework, we briefly review a classic distributed deep learning training method, namely, the D-PSGD (Decentralized Parallel Stochastic Gradient Descent) method [7]. Lastly, motivated by the analysis of the communication complexity of D-PSGD, we propose our RD-PSGD method more suitable for a low-bandwidth satellite constellation environment.

Distributed Training Framework for Smart Satellite
Constellations. The distributed training framework for smart satellite constellations is shown in Figure 1. In the framework, each satellite collects remote sensing images in real time and stores them locally. Besides, each satellite is equipped with an intelligent model to perform a certain perception or cognitive task, such as object detection and scene classification of the collected remote sensing images. If every satellite keeps its intelligent model unchanged, it cannot deal with remote sensing images of dynamic scenes or objects, and thus, it is necessary to learn and update the intelligent model onboard, whereas only training the intelligent model with its own local dataset is hard to achieve overall optimization and global convergence. Instead, for distributed training on a satellite constellation, multiple satellites can be connected by intersatellite links to form a communication network. A satellite is called a worker node in this network. A node not only trains the model using its own dataset but also exchanges and averages model parameters with adjacent nodes. In this way, the perception and cognitive capabilities of the satellite constellation can be fully utilized.
In this paper, we assume that the network is of a fixed ring structure, and there is no centralized parameter server in the system. As we mentioned earlier, this design can effectively avoid congestion of communication. However, the RD-PSGD method proposed later can be easily applied to other network structures.

A Review of D-PSGD.
The distributed training method proposed in this paper is built upon D-PSGD [7], which is a very popular decentralized distributed deep learning technique. Lian et al. [7] proved the convergence of D-PSGD and showed that D-PSGD outperforms centralized algorithms. The D-PSGD [7] considers the following stochastic optimization problem: where D is the training dataset and ξ is a data sample. x ∈ ℝ N denotes the serialized parameter vector of an intelligent model with a specified deep learning network architecture, which is usually a convolutional neural network (such as ResNets [23]), and Fðx ; ξÞ is a predefined loss function. The above optimization problem can be efficiently and effectively solved by the SGD algorithm [24].
To design parallel SGD algorithms on a decentralized network, the data are distributed onto all nodes such that the original objective defined in (1) can be rewritten as There are two ways to distribute D: shared data, D i = D; local data with the same distribution, i.e., D i , has the same distribution of D on local data, which is used in this paper. (ix) X k ≔ ½x k,1 , x k,2 ,⋯,x k,n ∈ ℝ N×n denotes the concatenation of the local parameter vectors at iteration k (x) W ∈ ℝ n×n denotes the weight matrix, i.e., the network topology, satisfying (i) W ij ∈ ½0, 1 and (ii) ∑ j W ji = 1. We use DegðWÞ to denote the degree of network W. For a ring-structured network, W is denoted by (xi) P α ∈ ℝ N×N denotes the matrix DiagðaÞ, where a ∈ ℝ N is a vector of independent Bernoulli random variables, each of which has probability α of 1 (xii) Bðn, αÞ denotes the binomial distribution By the definitions and notations, we have the D-PSGD in Algorithm 1.

RD-PSGD.
It is easy to check that the communication complexity of D-PSGD is OðN · DegðnetworkÞÞ. Whereas in a network with low communication capacity, D-PSGD may suffer from latency. Here, we introduce a random transferring technique to reduce communication, named RD-PSGD (Randomized Decentralized Parallel Stochastic Gradient Descent). Specifically, in the process of model synchronization with adjacent nodes, only a part of the parameters of the intelligent model need to be randomly selected for transmission, which reduces the bandwidth cost and the complexity of parameter filtering compared with the TopK-based methods [18,19]. The details are stated in Algorithm 2. Now, we prove the convergence of RD-PSGD. Firstly, in D-PSGD, it makes a commonly used assumption on the weight matrix. Assumption 1. W is a symmetric stochastic matrix with ρ W ≔ ðmax fjλ 2 ðWÞ, λ n ðWÞjgÞ 2 < 1.
From a global view, at the kth iteration, Algorithm 1 can be viewed as To prove the convergence of D-PSGD, it needs a critical property on the weight matrix W, i.e., Lemma 5 in the orig-inal publication of D-PSGD [7], which we reformulate as follows: Similarly, at the kth iteration, Algorithm 2 can be viewed as Denote the neighborhood weighted average as with α the sparsity ratio (α ∈ ð0, 1). Then, for each node, we just need to transfer the corresponding information specified in P α . Then, the communication complexity is almost OðαN · DegðnetworkÞÞ. When α < 1, the communication complexity is reduced. Denote G k α,W the operator with k times composite of G α,W . Compared with D-PSGD, we need to prove a similar property of G α,W like W in Lemma 2, to complete the proof of the convergence of RD-PSGD: Lemma 3. Under Assumption 1, for ∀X ∈ ℝ N×n , i ∈ f1, 2,⋯, ng, k ∈ ℕ, we have kX1 n /n − G k α,W ðXÞe i k 2 ≤ ρ s W kXk 2 F with probability at least Input: Initial parameter guess x 0,i = x 0 , learning rate γ, weight matrix W and maximum iteration K. Output: x K,i . 1: fork = 0, 1, 2, ⋯, Kdo 2: Randomly sample ξ k,i from local dataset D i ; 3: Compute gradient at current optimization parameters: ∇F i ðx k,i ; ξ k,i Þ; 4: Compute neighborhood weighted average by fetching optimization parameters from neighbors: x k+1/2,i = ∑ n j=1 x k,j W ji ; 5: Update x k+1,i = x k+1/2,i − γ∇F i ðx k,i ; ξ k,i Þ; 6: end for Algorithm 1: D-PSGD on the ith node.
Input: Initial parameter guess x 0,i = x 0 , learning rate γ, weight matrix W and maximum iteration K. Output: x K,i . 1: for k = 0, 1, 2, ⋯, K 2: At the1-th node, construct a vector a k , where each entry has probability α of 1, and then transfer this vector to other nodes; 3: Randomly sample ξ k,i from local dataset D i ; 4: Compute gradient at current optimization parameters: ∇F i ðx k,i ; ξ k,i Þ; 5: Compute neighborhood weighted average by fetching optimization parameters indicated in a k from neighbors: x k+1/2,i ðsÞ = ∑ n j=1 x k,j ðsÞW ji , s ∈ T ða k Þ;  Space: Science & Technology where Y~Bðk, αÞ.
Proof. Denote a k the random vector used to construct P α in the kth operator G α,W in G k α,W . Let t i ≔ ½a 1 ðiÞ, a 2 ðiÞ,⋯,a k ðiÞ ∈ ℝ k , each entry of which indicates whether the ith entry of local optimization parameters will be averaged at the kth iteration in Algorithm 2. Thus, t i is a vector of independent Bernoulli random variables. DenoteX k = G k α,W ðXÞ andX k,i the ith row of matrixX k . Denote X i the ith row of matrix X. By (6) and the definition of P α , a k and t i , we havê Let W ∞ ≔ lim k⟶∞ W k . Since W is doubly stochastic and ρ W < 1, we have 1 n /n = W ∞ e i , ∀i. Then, we have where Y i~B ðk, αÞ denotes the times of 1 in t i . Since ρ W < 1, we have If ∀i, we have then it is easy to check Thus, combined with that fY i g are independent, yielding This ends the proof. Denote which represents the decaying rate under probability than β. Suppose empirical, we have Assuming that the average speed of the intersatellite link is 2 MB/s, the effect of utilizing RD-PSGD is shown in Table 1. It can be seen that RD-PSGD can reduce bandwidth and time cost linearly and thus make distributed training in low-bandwidth environment practical.

Programming Optimization
The refinement of the proposed RD-PSGD algorithm over D-PSGD also introduces additional programming problems. For the D-PSGD method, a full cycle of parameter transmission for model synchronization, i.e., step 4 in Algorithm 1, can be divided into three parts: serialization of the parameters of the intelligent model (t serial ), communication of the parameters (t comm ), and deserialization of the parameters to recover the deep learning network structure of the intelligent model (t deserial ); namely, the time cost of For RD-PSGD, aiming at low-bandwidth communication of the parameters (t comm ′ ), extra steps are needed, including generation of the random index that indicates which parameters need to be transmitted and transmission of the random index (t rand ), i.e., step 2 in Algorithm 2, and extraction of parameters to be transmitted according to the random index (t extract ) and expansion of the extracted sparse parameters into dense network parameters (t expand ) in steps 5 and 6 in Algorithm 2; namely, the time cost of each cycle of parameter transmission for RD-PSGD is t RD-PSGD = t rand + t serial + t extract + t comm ′ + t expand + t deserial : ð18Þ Then, the difference between t D-PSGD and t RD-PSGD is Therefore, RD-PSGD has lower time complexity only if Equation (19) shows that (i) the generation and transmission of the random index and the extraction and expansion of the parameters of the intelligent model should be optimized as far as possible to give full play to the acceleration effect of RD-PSGD (ii) the lower the network bandwidth, the higher the value of t comm − t comm ′ , thus the more obvious the acceleration In order to improve the acceleration effect of RD-PSGD, we optimize the programming from two aspects.
Random index generation. We first observe step 2 of Algorithm 2, in which the random index vector a k is generated from Bernoulli distribution with probability α. Suppose the total number of the parameters of the intelligent model is N, this direct approach has to perform N times of random number generation and thresholding regardless of α. Consider a k is a binary vector, the indices of elements with value 1 are denoted by where the size of A 1k is αN. The difference of adjacent elements of A 1k is We can infer that the elements of A 1k,diff obey geometric distribution. Therefore, if we transform the random index vector a k into A 1k,diff , we need only perform αN times of random number generation. Another advantage of transforming a k into A 1k,diff is reducing bandwidth cost when the sparsity ratio is high. For example, when α = 0:05, an 8-bit integer is enough to represent the value of each element in A 1k,diff . In total, A 1k,diff employ 0:05 × 8N = 0:4N bit of data, which is less than N bit using 0-1 Boolean representation.
Parameter extraction. Regarding serialization, deserialization, extraction, and expansion operations in steps 5 and 6 of Algorithm 2, the general implementation refers to a series of time-consuming for-loop operations. Aiming at speeding up RD-PSGD, the built-in functions in Numpy and PyTorch is used as much as possible to take advantage of dedicated CPU and GPU acceleration for vector and tensor operations.   Space: Science & Technology

Experiments
We evaluate our RD-PSGD methods on several distributed training tasks for image classification on different benchmark datasets and deep learning network architectures by ground simulation. Specifically, we studied the performance of ResNet-20 [23] and VGG-16 [25] on CIFAR10 [26] and ResNet-50 [23] on ImageNet-1k [27]. In our experiments, we test RD-PSGD on a ring-structured network which consists of 8 worker nodes, each of which is simulated by a workstation with a RTX 3070 GPU. The dataset is randomly split into 8 subsets to simulate the different data collected by each satellite. The algorithm is implemented using PyTorch with Gloo as a communication backend. Models are trained using SGD with momentum and weight decay on every single node. The setup of hyperparameter is as follows: The convergence and bandwidth saving effect of RD-PSGD are analyzed. The acceleration effect of programming optimization is evaluated, and the time cost is compared with the TopK-based methods [18,19]. Figures 3-5 show the convergence of the loss function and prediction accuracy of different models with different sparsity ratios using RG-PSGD. As shown in Figure 3, training ResNet-20 on CIFAR-10 can achieve convergence under different sparsity rates with no accuracy loss. When the sparsity ratio is 0.1, i.e., the transmitted model parameters are reduced by 10 times, the training accuracy can still reach more than 90%. A similar phenomenon also presents when training VGG-16 on CIFAR-10 and training ResNet-50 on ImageNet-1k, as shown in Figures 4 and 5. These results demonstrate that the proposed RD-PSGD method can converge on different distributed training tasks under different sparsity rates.

Bandwidth
Cost. The bandwidth cost of one full cycle of transmitting ResNet-50 from one node to another is shown in Figure 6. When the sparsity ratio is close to 1, the bandwidth cost of RD-PSGD is higher than D-PSGD, because an extra vector containing the indices of parameters is transmitted. After the critical value around 0.8, as the sparsity ratio continues to descend, the bandwidth cost decreases approximately linearly.

Programming Optimization.
We evaluate the time cost of training ResNet-50 on ImageNet for one epoch using different methods. Table 2 shows that, without programming optimization, the time cost of RD-PSGD is even higher than D-PSGD. After the programming optimization is applied, the average time cost of generation of random index reduces from 0.432 s to 0.056 s, which is speeded up by~8x. Meanwhile, the average time cost of extraction and expansion of parameters reduces from 12.855 s to 0.431 s, which is speeded up by~30x. And the whole speed up effect of programming optimization is shown in Table 2, where the time As we mentioned earlier, the lower the bandwidth, the more obvious the acceleration. To prove this, we use the Trickle tool to limit the bandwidth to no more than 200 kb/s. We define the time cost of parameter synchronization in one epoch as the whole time cost deducting the time needed for gradient calculation and backpropagation. The result in Table 3 shows that the speed-up ratio is indeed higher when the bandwidth is lower, which is more relevant to the sparsity rate. Table 4 shows the time cost of parameter extraction of the TopKbased methods [18,19] and RD-PSGD at sparsity ratios 0.1 and 0.5. The result indicates that RD-PSGD can accelerate the parameter extraction by 4 ×~7 × compared with the TopK-based methods, through selecting the parameter to be

Conclusion and Future Work
This paper proposed RD-PSGD, a decentralized distributed training algorithm with low-bandwidth consumption for a smart constellation that randomly selects a part of model parameters to transmit. We prove the convergence of this algorithm theoretically and optimize the programming to further speed up the practical application. The experiment results show that the convergence and acceleration requirements in a low-bandwidth environment can be met, and the algorithm can outperform the TopK-based method in parameter extraction, which shows that this is a promising method for future distributed deep learning on a spacebased remote sensing system. The work in this paper can be improved in the future. Firstly, the algorithm is tested on distributed training tasks on a labeled dataset, while the data used for onboard training are usually unlabeled. The algorithm can be extended for semisupervised or unsupervised training. Secondly, our current experiment is conducted in a cluster environment with fixed network topology and homogeneous nodes on the ground, using software to simulate the low-bandwidth intersatellite network. We will study the performance of the algorithm in a dynamic heterogeneous network and carry out the onboard verification and corresponding engineering optimization research in the future.