AI@Edge — Neural Tomography: Predicting Network Characteristics Using Sparse End-to-end Observations
Authors: Liang Ma, Utpal Mangla, Satish Sadagopan, Mudhakar Srivatsa, Mathews Thomas and Dinesh Verma
In our first blog we provided an overview of the challenges in implementing AI/ML at the edge. In our second blog we introduced coresets — machine learning algorithms to succinctly summarize the data from the edge devices without losing data quality, while enabling real-time decisions with pre-trained AI models at the point of action, and at the speed of 5G. In our third blog, we discussed how, with Federated Learning, AI/ML models can learn from data residing across multiple edges, without sharing the raw data (thereby, offering higher levels of data privacy). In the fourth blog, we discussed NeuralFP, a technique for Model Fingerprinting, that allows us to select the right ML model for the dataset we are dealing with, based on the characteristics of the local/edge data.
In this blog, we will discuss Neural Tomography that allows us to monitor the network performance and dynamically configure the network for optimal performance. The 5G network facilitates the dynamic re-configuration or creation of new network slices that allows mobile entities to have the most optimal network quality along the mobility path/route
Returning to our ambulance use case that was introduced in the previous blogs, the state of the network performance along the route of the ambulance is constantly monitored for any anomalous conditions of degradation or outages, and in near real time, a corrective action is applied to resolve the anomaly. This may involve applying a configuration change to the parameters of a 5G network slice, or to route the ambulance network traffic to a new slice with the required network bandwidth, Quality of Service (QoS) or various other changes to the network. Determining what the anomaly is not easy as we will not know the precise details of the network topology, given all the far edge devices and near edge nodes which are interacting with the core network. In addition, there are security constraints, inability to extract network performance data from desired nodes and other issues which prevent a clear understanding of the end-to-end network.
With Cloud Pak for Watson AIOps and Cloud Pak for Network Automation, network performance data, fault data, logs and other instrumented data is collected, and with AI/ML models, the anomaly is detected. This is then analyzed by an AI Engine for the best resolution steps to apply. With Neural Tomography, the best path for the ambulance can be determined as applicable, and a corresponding network slice is determined. Those resolution steps are orchestrated via the Network Automation Orchestrator for a closed loop automated resolution.
However, in reality, there are challenges with accurate network topology and collection and analysis of all the required data points. We will now discuss how with the help of Neural Network Tomography, we can predict network characteristics and reconstruct network topologies with significantly higher accuracy and robustness using only limited measurement data. With only end-to-end path performance metrics of sampled node pairs, Neural Tomography, utilizes deep neural network and data augmentation to predict the unmeasured performance metrics via learning non-linear relationships between node pairs and underlying unknown topological/routing properties. In addition, Neural Tomography can be employed to reconstruct the original network topology, which is critical to most network planning tasks. These techniques can contribute significantly to better understand the network in edge computing where is it is often difficult to get the required data to best analyze and respond to the network issues encountered at the edge.
In particular, Neural Tomography eliminates the need for full topology information and specific assumptions around additive performance metrics (e.g., latency is additive, packet loss rate is log additive). The input to neural tomography framework is only a set of end-to-end path measurements with respect to some node pairs, and the output is the predicted path performance metrics for all unmeasured node pairs. For each input data point, the only available information is the starting/terminating nodes and their corresponding path performance metric. The neural tomography framework is based on a deep neural network model, which learns the non-linear relationship between the node pairs and their path performance metrics via maximizing the likelihood of nodes appearing in each path. Extensive experiments via both Rocketfuel and CAIDA network data show that measuring only 30% node pairs, Neural Tomography is able to accurately predict the path performance of the rest 70% node pairs, with the mean absolute percentage error (MAPE) as small as 2%.
The figure above shows the neural tomography architecture. The model consisting of k fully-connected hidden layers, where each hidden layer contains γ neurons. Here, γ is the estimated number of links in the network: a precise estimate improves model accuracy; however, a conservative overestimate often suffices, though tight estimates would require lesser amounts of training data. At the input layer, the node pair, say v₁ and v₂, is mapped to an n-dimensional “two-hot” vector v₀ (n is the total number of nodes). Next, as in typical fully connected neural networks, v₀ᵀ is multiplied by an n×γ matrix M₁ and then added by a bias vector b₁. The resulting v₀ᵀM₁+b₁ᵀ is passed to hidden layer 1 and taken as input by an activation function, i.e., hidden layer 1 outputs σ(v₀ᵀM₁+b₁ᵀ). Each of the following-up hidden layers has the same activation function σ(·) and operates by the same way. Finally, vₖᵀ generated by hidden layer k is multiplied by a γ×1 weight vector m, i.e., vₖᵀm, as the final path performance metric between the input node pair v₁ and v₂ at the output layer (only one neuron and no activation function or bias in the final output layer). In essence, each hidden layer essentially captures the probability of a link appearing on the path between v₁ and v₂ (often referred to as the stochastic routing matrix). The use of multiple hidden layers attempts to refine the probability of each link appearing on a particular path between v₁ and v₂. The network essentially learns to mimic this stochastic learning matrix for both additive and non-additive metrics using sigmoid as the choice of activation function.
The figure below shows the estimation of effectiveness of neural tomography on both additive and non-additive performance metrics as measured on AS3257 topology. Neural tomography has the ability to accurately learn/predict end-to-end performance metrics from about 30% raw end-to-end measurements. It produces significantly better match to ground truth measures when compared with NMF (Non-Negative Matrix Factorization) and NeuMF (Neural Matrix Factorization).
Further, Neural Tomography can be used to learn the topology by setting the performance metric to the minimum number of hops. While it is more challenging to predict the one hop (immediate) neighbors of a given node with low false negative rate, hops counts of size three of higher could be determined with fairly low false positive and false negative rates of under 10%. The topology inference errors in other state of art techniques (such as NMF and NeuMF) were very poor, often resulting in greater than 50% error rates for both false positives and negatives.
In conclusion, Neural Network Tomography can significantly contribute to improving edge computing by applying AI to predict network characteristics and reconstruct network topologies with significantly higher accuracy and robustness using only limited measurement data. With only end-to-end path performance metrics of sampled node pairs, Neural Tomography can predict the unmeasured performance metrics and be employed to reconstruct the original network topology. With more visibility into accurate topology, in the ambulance use case, we are able to guide the ambulance along the best route that will offer the best network coverage and performance, such as, the best network slice, ensuring the best care for the patient en-route as well timely arrival at the destination.