Closed Loop Automation with AI@Edge

Edge AI @ IBM Research
6 min readMar 26, 2021

--

Authors: Utpal Mangla, Satish Sadagopan, Mudhakar Srivatsa, Matthews Thomas, Dinesh Verma

With this blog, we would like to close the AI@Edge blog series with a point of view on applying AIOps towards anomaly detection and closed loop automated resolution for managing telecom network and IT workloads from edge to core. The concept of AIOps, and the realization of the capabilities with IBM’s Watson AIOps has been described elegantly in Watson AIOps: AI for IT operations and Management and this blog builds on that.

In this blog, we would like to expand on that with the concept of distributed AIOps, to apply those capabilities from the far edge to the core, at the point where it makes most sense to apply anomaly detection and automated resolution.

In the telecom network, unlike with centralized IT data centers, the network functions are de-centralized & distributed in a hierarchy of data centers (or offices) as depicted below.

With such an architecture, if solutions such as AIOPs were to be applied centrally, the network monitoring data such as events, metrics, logs etc. have to be hauled from edge to central data centers for analysis, diagnosis and triggering of actions. As can be expected, this introduces all the shortcomings of hauling data from edge to core on the control/management plane such as unnecessary bandwidth consumption, centralized analytics on large datasets, latency due to data transfer from edge to core and transfer of insights & actions from core to edge.

This leads to the concept of distributed AIOps, by performing anomaly detection and resolution at the distributed data centers for the span of data and control available at each of those locations, and handling any exceptions and scenarios that need analytics and diagnosis for larger areas at the higher layers of hierarchy as we move towards the core, as depicted below.

With this architecture, the AIOps stack at the edge data centers will be trained to look for patterns within that data set associated with the types of network and IT workloads that are deployed in that layer of the network, similarly the AIOps stack in the regional data centers are trained to detect patterns for the network and IT workloads that are deployed at that corresponding layer of the network, and at the core, the AIOps stack is looking for patterns that are global across the entire network.

The AIOps stack will not only detect the patterns, but will also perform root cause analysis, determine best resolution actions and apply those resolution steps via automation and orchestration tools within those areas of control.

This allows for more specific AI models for anomaly detection, associated with the types of workloads, which are simpler (less complex) and nimbler (faster results) since they are closer to the source, lesser data sets, but more relevant. The automated resolution workflows for identification of best resolution and execution of those steps are also simpler, faster and more relevant to the scope of control and management.

Let us now return to the ambulance use case. As the ambulance travels to the hospital, it can encounter many issues such as poor video transmission or delayed transfer of the vital stats of patient or incomplete data being transferred about the patient, due to data loss. Resolving such an issue can be complex as there could be multiple reasons for this, but primarily boils down to two areas

  • A network issue, but we need to identify where the issue lies. It could be at the edge node, in the core network or any of the other network nodes
  • An application issue where the application is using the wrong model to analyze the data or is overloaded with information or the right application is not running at the edge node

It is not sufficient to simply identify the issue, but one needs to now correct the issue as quickly as possible with minimal human intervention. The solution could be, making a change to the network, deploying new applications or advising the ambulance driver to take a different route. In addition, it is important to ensure the same problem does not recur, so learning from this exercise will prevent future occurrences.

The solution is to integrate AIOps into the end to end network of the CSP including the MEC and other nodes which runs the applications. AIOps will be integrated with the core primitives in the SDK described earlier in this series such as Coresets, Federated Learning, Model Fingerprinting, and Neural Tomography. The full or subset of the AIOPs stack will be running at the different nodes of the CSP’s network. The primary limitation on what can run will be the footprint of the node. The AIOps system closest to the data will gather the information including metrics, alerts, events, logs, application and infrastructure topology and deployment configurations as described in Watson AIOps: AI for IT operations and Management and referenced above. AIOps will use its event management, incident diagnosis, incident resolution, and insights delivery capabilities to resolve and correct the issue. In some cases, the AIOps system at a particular node may not have sufficient compute or the right data to resolve the issue. If so, the relevant data will be sent to related AIOps systems in the Telco network hierarchy. Functionality available in the SDK will also be integrated into this. For example, Coresets could be integrated with this process to ensure only the relevant data is sent. The systems upstream can further analyze the data and use Federated Learning to further enhance the models running on the AIOps systems closest to the system. If AIOps is able to resolve the issue, it will usually interact with an RPA system which will then implement the appropriate changed to rectify the problem.

An example use case to correct network degradation when the ambulance is in motion:

  1. The AIOps system has been trained to resolve specific network related performance issues on the transport layer. The training consists of learnings from time series and log data, events, and documents. This AIOps system will usually run in the core network or main data center of the CSP.
  2. The log data and time series data at all the edge nodes are constantly being monitored by AIOps including using the predictive models to determine potential degradation of the network. These predictive models could run at the far edge or closer to the core network. Coresets will be used to ensure the right data is sent to the Predictive Insights system.
  3. Once the predictive models identify a potential issue, it will use the pretrained models described in (1) to resolve the issue. If a resolution is found, it will invoke the orchestration layer to correct the issues at the network layer by making the appropriate configuration in the SDN layer. This may also require changes to the application layer so that the application is updated with the appropriate models using the Federated Learning SDK.

This is the last of our series on AI@edge. Emerging developments in Edge, 5G and AI bring many exciting opportunities to implement new use cases across multiple industries. It also brings new challenges on how to ensure the AI components run as effectively as possible at the edge. Federated Learning, Coresets, Model Fingerprinting, and Neural Tomography are some core primitives which when integrated into the AIOps environment can address some of the challenges. IBM continues collaborating with our partners, academic institutions and government labs and we hope this series gave you some insights on the ongoing work and invite you to join us in the journey.

--

--

No responses yet