Data-Driven Machine Learning Techniques for Self-healing in Cellular Wireless Networks: Challenges and Solutions

For enabling automatic deployment and management of cellular networks, the concept of self-organizing network (SON) was introduced. SON capabilities can enhance network performance, improve service quality, and reduce operational and capital expenditure (OPEX/CAPEX). As an important component in SON, self-healing is defined as a network paradigm where the faults of target networks are mitigated or recovered by automatically triggering a series of actions such as detection, diagnosis and compensation. Data-driven machine learning has been recognized as a powerful tool to bring intelligence into network and to realize self-healing. However, there are major challenges for practical applications of machine learning techniques for self-healing. In this article, we first classify these challenges into five categories: 1) data imbalance, 2) data insufficiency, 3) cost insensitivity, 4) non-real-time response, and 5) multi-source data fusion. Then we provide potential technical solutions to address these challenges. Furthermore, a case study of cost-sensitive fault detection with imbalanced data is provided to illustrate the feasibility and effectiveness of the suggested solutions.


Introduction
With the development of cellular networks towards 5G and beyond, they are evolving to more complex structures featured by heterogeneity and dense deployment. In these networks, traditional (e.g., nonautomated) methods for network deployment, configuration, optimization, and maintenance will be inefficient and will incur huge operational and maintaining expenditures. This has led to the concept of selforganizing network (SON) advocated by the Third Generation Partnership Project (3GPP) and Next Generation Mobile Networks (NGMN) alliance. SON includes three main functions: self-configuration, self-optimization, and self-healing. SON capabilities will enable more flexible planning and deployment of mobile networks, more efficient optimization and maintenance, less manual intervention, and lower capital expenditure (CAPEX) and operational expenditure (OPEX) [1].
As an important SON functionality, self-healing will automatically detect faults of target networks (e.g., cellular networks) and trigger corresponding actions to fix them. The self-healing functionality mainly includes four phases: fault detection, diagnosis, compensation, and recovery. The goal of fault detection is to find problems such as unacceptable service quality (e.g., due to coverage hole, excessive interference, and excessive antenna uptilt or downtilt). Fault diagnosis identifies the root cause based on key performance indicators (KPIs) and alarms. After the faults are identified, recovery actions are launched. To ensure quality of service in the process of fault recovery, a compensation mechanism is triggered to mitigate degraded network performance in the affected zone through tuning some involved network parameters automatically.
In traditional networks, it is common that operators are aware of service failures only after receiving a large number of user complaints. And for failure recovery, the experience of technicians is of paramount importance. In comparison, the objective of self-healing is to perform these tasks automatically in an active manner. Naturally, the introduction of intelligence into networks is required, for which machine learning has been recognized as a powerful tool. Specifically, machine learning techniques are able to automatically generate inference and classification models by training collected data, offering accurate results for reliable decision making. Different types of machine learning techniques (e.g., supervised learning, unsupervised learning, and reinforcement learning) have been leveraged for self-healing. For example, many learning algorithms are devised to detect cell outage and to compensate degraded network performance of problematic cells. Some algorithms can train classifiers for fault diagnosis which can discriminate different faults.
The main contributions of this article are threefold. Firstly, we discuss the challenges in data-driven machine learning algorithms for self-healing. Secondly, we provide some potential solution directions for addressing these issues. Thirdly, we provide a case study of cost-sensitive fault detection with imbalanced data to illustrate the feasibility and effectiveness of the suggested solutions.

Materials and Methods
2.1. Challenges in Data-Driven Machine Learning. Although machine learning technologies facilitate the development of self-healing methods for cellular networks, several major challenges exist which can impact the performance and practical implementations. In this article, we classify these challenges into the following five categories: (i) Data Imbalance. In cellular networks, due to the occurrence of rare events (e.g., network failure), the collected data sets are usually imbalanced. These imbalanced data can significantly impact the performance of classifiers, which is likely to have a skew towards the majority class. However, existing schemes rarely take the issue of data imbalance into account (ii) Data Insufficiency. The insufficiency of high-quality data can result in severe overfitting of learning models (e.g., classifier). Firstly, the data set obtained from high-fidelity network simulators may not fully represent the measurements in practical cellular networks. While the real data from network operators (e.g., log data) may not be well organized and labeled, it is difficult to extract effective information and build knowledge from these data (iii) Cost Insensitivity. Most of the existing schemes pursue a low detection error rate, while ignoring the fact that different types of misclassification errors can cause different losses to the operators. In such case, considering accuracy as the only evaluation criterion is defective and cost sensitivity should be considered (iv) Non-Real-Time Response. Most of the existing selfhealing schemes do not meet the real-time response requirements due to their reactive characteristics. Specifically, they are mainly based on postoperations (e.g., diagnosing after malfunctions occur).
Designing proactive schemes to reduce the delay and enable real-time response is challenging (v) Multisource Data Fusion. Theoretically, data from varying levels such as subscriber level, cell level, and core network level can be jointly exploited for achieving better performance [2]. However, the multisource data bring difficulties to model construction. Therefore, performing multisource data fusion for self-healing is a challenging issue 2.1.1. Data Imbalance. Data imbalance often occurs in machine learning and data mining, when at least one class contains more samples compared to other classes. For convenience, we term the class containing numerous samples as the majority class, and the class including a relatively small number of samples as the minority class. The ratio of the number of samples between the minority and majority classes is used to measure the degree of data imbalance. In general, when this ratio is close to 1, the data imbalance can be negligible. On the other hand, when the ratio is significantly less than 1, the imbalance may hamper the performance of classifiers significantly.
In self-healing, fault detection and diagnosis can be considered typical classification problems. Accordingly, existing machine learning-based classification methods can be applied for which measurement data during networking operation period are collected to train corresponding classifiers. Note that a cellular network functions well during most of the running time, and service failure or degradation appears with a relative low probability. In this case, the amount of collected normal status data overwhelms that of abnormal data, which then generates imbalanced training data.
However, traditional classification algorithms are designed on the premise of the balanced data set. When they are applied to fault diagnosis in a cellular network which has imbalanced data, the results usually lead to a bias towards the majority class, and the classification accuracy for the minority class is not satisfying [3]. Accordingly, data imbalance poses a challenge for self-healing which is not considered in most of the existing works.
2.1.2. Data Insufficiency. Currently, the lack of high-quality data poses a big challenge for application of machine learning-based self-healing mechanisms, which could even hinder the development in this area. Machine learning algorithms usually require adequate training data to train a stable model. Nevertheless, when the training samples are insufficient, the learners are likely to achieve knowledge from peculiar features rather than common features in data sets, which may result in severe overfitting.
Data insufficiency arises mainly due to the following reasons. First, for most researchers in universities and research institutions, acquiring sufficient data from network operators is not an easy task due to privacy and business issues. In the existing literature on self-healing, most of the works use data from some high-fidelity network simulators (e.g., NS3, Vienna-LTE, and LTE-Sim). Though these simulators provide a good simulation environment, the data collected via simulations cannot fully represent real network scenarios. Also, mobile network measurement data may be collected by means of third-party sniffers or some applications in mobile devices, some measurement data (e.g., fault indicating data) 2 Intelligent Computing are difficult to collect. Second, network operators have a huge amount of operation data which are stored in system logs. However, these data may not be well organized and labeled. Accordingly, extracting effective information is difficult. Data insufficiency also arises due to limited labels. Compared to labeled data, unlabeled data are characterized by large volumes. Annotating these unlabeled data usually requires experienced engineers and is time/cost consuming, and in some cases, it may not always be feasible. Therefore, applications of machine learning algorithms for self-healing need to address the challenge of training a model from insufficient real-world data.

Cost Insensitivity.
In order to evaluate the performance of a machine learning method, metrics such as accuracy, generalization ability, interpretability, time, and space complexity, as well as cost sensitivity need to be taken into account. However, the traditional machine learning methods for self-healing focus primarily on maximizing accuracy, and they ignore the cost involved in the classification process (i.e., assume equal costs for different misclassification errors). In real-world scenarios, different misclassification errors often have varying costs. For example, within the process of cell failure diagnosis in self-healing, the cost of mistakenly diagnosing a malfunction as a fault-free case is larger than that of identifying a fault-free case as a case of malfunctioning. Detecting a fault-free case as a malfunction at least can attract the attention of engineers and make them take actions to check the failure. However, it means that the network fault is neglected when diagnosing a fault as a normal case mistakenly. Thus, in self-healing, it is unreasonable that different misclassification errors are assigned equal costs.

Non-Real-Time Response.
Existing self-healing mechanisms cannot meet the real-time response requirements for future mobile networks due to their reactive characteristics. This is due to the fact that they depend on postoperations (e.g., detecting and diagnosing after malfunctioning occurs) which can lead to poor service quality for subscribers. For real-time response, the network needs to be fully aware of the changes in context, so that a timely response can be taken when network degradation or malfunction occurs.

Fusion of Multisource Data.
Currently, for self-healing, cell-level data are frequently utilized for detecting, diagnosing, and recovering from network faults, as well as performing compensation during a performance degradation period. Theoretically, data from different levels of sources such as the subscriber level, cell level, and core network level can be jointly exploited to achieve better performance [2]. For example, subscriber-level data (e.g., connection and drop rate, throughput, and delay) are collected from diverse user devices and reflect the communication quality at the user side as well as users' communication behavior/pattern. Operator-level data (e.g., Minimization of Drive Test (MDT) reports, received interference power, and channel quality indicator (CQI)) are collected by the Operation and Maintenance Center (OMC) to monitor the changes in network.
However, data from multiple sources are characterized by different modality and granularity. Also, there could be ambiguity and spuriousness. Accordingly, these multisourced data cannot be directly exploited by most of the existing machine learning algorithms for self-healing. How to process multisource data and adjust the corresponding algorithms to achieve the potential benefits of data fusion is a challenging problem.

Solution Approaches.
In this section, we will present several potential approaches for addressing the above challenges in application of machine learning techniques for self-healing. A concise description of the solutions is given in Table 1.
2.2.1. Solutions for Data Imbalance. We will present the following two types of solution approaches for handling the data imbalance problem: data preprocessing and algorithm modification.
(1) Data Preprocessing. It is aimed at converting imbalanced data to balanced ones through changing the distribution of target data sets before they are fed to the machine learning algorithms. The common preprocessing methods are undersampling and oversampling, which change the distribution of training samples. Specifically, undersampling is used to remove several majority class samples randomly and oversampling is used to duplicate the minority class samples till a balanced data set is produced. However, undersampling may result in some important information in the majority classes being lost, and oversampling may result in overfitting due to the duplicating operations of the minority class samples [3]. One method to mitigate this problem is to combine undersampling with oversampling to achieve a trade-off between less information loss in undersampling and less severe overfitting in oversampling. Another option is to enhance the diversity of samples through combining resampling and other techniques, such as K-nearest neighbor (KNN). One common method is synthetic minority oversampling technique (SMOTE), which produces more samples for the minority classes through computing and inserting new instances among randomly selected data and their K-nearest neighbors [3].
(2) Algorithm Modification. The classification problems for imbalanced data sets also can be addressed through improving existing machine learning algorithms. Two kinds of solutions may be effective. One is to set reasonable classification boundaries for the minority class samples. One-class classifier is a common method for this solution, which estimates a boundary to encompass as sufficient data as possible in each class while minimizing classification errors caused by outliers [4]. The other one is to combine learning algorithms with other technologies, such as resampling and costsensitive learning. For instance, the combination of traditional classifiers (e.g., support vector machine (SVM) and decision tree) and resampling technologies is a good method to improve the classification performance for imbalanced data [3]. (1) Data Preprocessing. The issue of data insufficiency can still be addressed through generating more data artificially. In this context, some methods used to tackle the problems of data imbalance such as random oversampling and SMOTE and its variants are suitable to cope with the data insufficiency problem.
(2) Algorithm Modification. In the algorithm level, the common solutions are to combine data preprocessing with existing machine learning algorithms. Besides, the concept of transfer learning, which is based on the idea of acquiring knowledge from one problem/field (source domain) and adopting them to the learning tasks for a new problem/area (target domain), can be a promising solution approach to overcome the problem of data insufficiency [5]. In self-healing, some learning tasks are similar to the ones in other networks. For example, there could be enough data available from industrial or wireless sensor networks related to the tasks such as error recovery and intrusion detection. These data could be used to train a learning model and transfer it to a machine learning model for self-healing.
(3) Learning with Sufficient Unlabeled Data. When there are limited unlabeled samples and numerous unlabeled data, the three following solution approaches can be used, active learning, unsupervised learning, and semisupervised learning. Specifically, unsupervised learning deals with the clustering problems of unlabeled data. Semisupervised learning is used to promote the stability of learning models by combining labeled and unlabeled data. Active learning employs learning and selecting engines to find the most useful unlabeled samples, which can be manually annotated to achieve more labeled data [6].

Solutions for the Cost-Insensitivity Problem.
These solutions assign distinct costs to different classes or each sample within the processes of learning and decision-making, so that the classifiers pay more attention to costly classification results. These solutions can be classified into two types: cost-sensitive learning and introducing new evaluation metrics.
(1) Cost-Sensitive Learning. In general, the classification costs are assigned either for different categories or for each sample, which are known as class-dependent cost and example-dependent cost, respectively [7]. The classdependent cost denotes that different classes have distinct costs while each sample in one class has an equal cost, whereas the example-dependent cost implies that each sample has a different cost even when these samples belong to the same class.
(i) Class-Dependent Cost. The methods based on classdependent costs [7] allocate different penalties for different classes. These algorithms are mainly classified into two groups. One is to embed this cost into common classifiers like SVM to achieve more costsensitive classifiers. Another is based on the Bayesian decision theory to achieve minimum misclassification costs through minimizing conditional risk (ii) Example-Dependent Cost. The approaches based on example-dependent costs are aimed at changing cost-sensitive learning tasks into cost-insensitivity ones by two types of methods. One is altering the distribution of original samples based on different weight values. The other is to optimize the weight values of weighted original samples to obtain expected minimum classification costs [7] (2) Introducing New Evaluation Metrics. In the presence of data imbalance or cost insensitivity, evaluating an algorithm using accuracy only is not enough. Instead, more effective metrics should be adopted and such measures include F-measure, receiver operating characteristic (ROC) curve, precision-recall curve, and cost curve [3]. These evaluation criteria use true positive, false positive, true negative, and false negative to reflect system performance from different angles. For example, the ROC curves take true and false positive rate into account through placing them in the same coordinate system.

Solutions for the Non-Real-Time Response Problem.
The main solution for the non-real-time response problem is to upgrade the existing self-healing approach from reactive to proactive. Toward this, in Figure 1, we introduce a proactive context-aware self-healing framework. The core idea of this framework is that the system predicts changes of near-term network performance by using real-time models to capture knowledge from historical and current context information. When possible faults in the near future are predicted, it can trigger self-healing to adjust the current network parameters so that possible loss caused by the faults is minimized. Different components of this self-healing framework are described below.
(i) Data Collection. This block is primarily used to gather enough context information, which are classified into three groups: network context, user context, and device context [8] (ii) Data Preprocessing. The raw data gathered from different contexts are not directly used by prediction models since they have several characteristics such as redundancy and different granularity. The goal of this step is to change these imperfect raw data into available ones by using methods such as filtering, ranking, and fusion (iii) Context Prediction Model. The main task of this block is to build prediction models (e.g., regression models) with the processed data (iv) Self-Healing. This block is mainly used to analyze the predicted results from the prediction model and generate corresponding actions for recovery or compensation (v) Dynamic Response. This block is used to perform actions for self-healing by generating new parameters to be fed into the network to reconfigure it

Solution for the Fusion Problem with Multisource Data.
Fusion of multisource data is aimed at obtaining unified information by analyzing and reorganizing the data which come from heterogeneous devices and different scenarios. According to [9], we classify the solutions for the data fusion problem into three types: probability-based methods, the theory of evidence-based methods, and artificial intelligencebased approaches. Briefly, probabilistic techniques (e.g., Bayesian analysis and Markov Chain) are frequently utilized to discover consistent information from random variables, events, or processes, which makes them suitable for dealing with uncertain and imprecise multisource data. The theory of evidence-based methods usually uses symbolic variables and combination rules to infer consistent information from  Figure 1: A proactive context-aware self-healing framework. 5 Intelligent Computing multisource uncertain data. Moreover, the artificial intelligence approaches (e.g., machine learning, fuzzy logic, and genetic algorithms) are used for data fusion due to their strong ability in processing large-scale complex data.

Case Study: Cost-Sensitive Fault Detection with
Imbalanced Data. We provide a case study on fault detection in order to illustrate the challenges and solutions related to data imbalance and cost-sensitivity issues in machine learning-based solutions for self-healing. We propose a mechanism which enables fault detection through discriminating fault and fault-free measurements, jointly considering data imbalance and cost sensitivity.
2.3.1. Classification from Imbalanced Data. To handle the problem of data imbalance in fault detection, resampling techniques can be used to preprocess the data. However, as mentioned before, the undersampling may result in some important information in the majority classes being lost due to its removing operations, and the oversampling may result in overfitting due to the duplicating operations of the minority class samples. SMOTE as an improved scheme can generate new minority class samples by means of neighboring samples. And the neighbor samples are selected by the K-nearest neighbor algorithm. Therefore, SMOTE can avoid the overfitting problem. With this method, a new minority sample could be obtained as follows: for a sample X i in the minority class, find the K-nearest neighbors to X i , then randomly select a sample X j from above neighbors, calculate their difference diff = X i − X j , and finally obtain a new sample by X n = X i + rand ð0, 1Þ * ðX i − X j Þ. The entire process of fault detection is shown in Figure 2. In this experiment, we employ the following two methods to demonstrate the necessity to consider data imbalance for machine learning algorithms in self-healing, and experimental results are shown in Figure 3.
(i) Method 1. We use SVM to classify the imbalanced data set directly (ii) Method 2. First, oversampling and SMOTE are used to preprocess the imbalanced data set to convert the imbalanced data to balanced data. Next, SVM is used to classify the balanced data set 2.3.2. An Example of Cost-Sensitive Learning in Self-Healing. We will explain the necessity of considering cost sensitivity for the existing machine learning algorithms in self-healing and show changes in classification results under different costs. We use C ij (i, j ∈ f0, 1g) as the cost of misclassifying true class i to predicted class j, where we preset C 00 = C 11 = 0 and C 10 = 1. Class 0 and class 1 represent fault and fault-free classes, respectively. The cost ratio denotes the ratio of C 01 and C 10 . We have done two tests, which are described as follows: (i) Test 1. At first, we use SVM to train a model with a training set and utilize cost-sensitive SVM (CS-SVM) [10] to train different models based on vary-ing cost ratios (i.e., changing C 01 from 1 to 30). Also, we combine CS-SVM with SMOTE. Next, the test set is utilized to validate the model, and in this process, misclassification costs are calculated through comparing the predicting labels with test set labels. Finally, the total costs along with different cost ratios are achieved, and experiment results are shown in Figure 4 (ii) Test 2. We compare the classification results of CS-SVM under setting different cost ratios, which are shown in Figure 5

Results
We use the simulation scenario proposed in [11]. We only consider the binary classification problem in this article. An imbalanced data set is utilized, and there are 117 fault data and 3,363 fault-free data in classes 0 and 1, respectively. We split the entire data into a training set (including 2,783 data) and a testing set (including 696 data), and each data is composed of seven key performance indicators (KPIs): retainability, handover success rate, reference signal received power (RSRP), reference signal received quality (RSRQ),  6 Intelligent Computing Signal-to-interference-plus-noise ratio (SINR), throughput, and distance. For performance evaluation, we show the results through ROC curves and use the area under the ROC curve (AUC) to compare different classification algorithms. The larger the AUC, the better the classification performance.
For classification based on balanced fault data, the corresponding results are shown in Figure 3. As can be seen, compared to method 1, method 2 achieves a higher AUC. This demonstrates that, when the data are imbalanced, the performance of traditional classifiers is tempered, and prepro-cessing imbalanced data via oversampling and SMOTE is an effective method to improve it. In addition, comparing oversampling with SMOTE, the latter works better. This illustrates that SMOTE can improve the performance of random resampling to some extent.
With regard to the experiment related to cost sensitivity, as can be seen from Figure 4, with the cost ratio changing from 1 to 30, the total costs of traditional SVM increase linearly while they do not increase for CS-SVM when the cost ratio is larger than 20. Also, lower total costs can be achieved by adding SMOTE on top of CS-SVM. This illustrates that cost-sensitive algorithms can effectively control misclassification results, and a hybrid of SMOTE and cost-sensitive learning can provide better results for the classification of imbalanced data. Also, we can see from Figure 5 that, when presetting a larger cost ratio for CS-SVM, higher classification performance is obtained. This indicates that the detection of network faults can be easier when setting a larger cost ratio.

Discussion
As a key component in SON, self-healing will play a vital role in realizing intelligent operation in next generation cellular networks. It has been well recognized that data-driven machine learning techniques are useful for the development of self-healing mechanisms, and much research efforts have been put into this topic. However, the application of machine learning techniques in this paradigm faces challenges such as data imbalance and insufficiency, cost insensitivity, non-real-time response, and the fusion of multisource data. In this article, we have concisely discussed these challenges and provided potential solutions. Besides, a case study of cost-sensitive fault detection has been presented to illustrate the effectiveness and feasibility of suggested approaches.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon reasonable request.