AI@Edge — Coreset: Intelligent Data Sampling for AI/ML at the 5G Edge

Edge AI @ IBM Research
5 min readNov 13, 2020

Authors: Wendy Chong, Utpal Mangla, Satish Sadagopan, Mudhakar Srivatsa, Mathews Thomas, Dinesh Verma, Shiqiang Wang

Many modern applications generate a huge volume of data at edge devices, such as Internet of Things (IoT) sensors, cameras, smartphones, etc. With 5G, this is only expected to grow more. It is highly beneficial if such data can be leveraged to train new machine learning (ML) models, which, however, is challenging due to communication bandwidth and storage limitations as well as privacy regulations that may prohibit sharing raw data with the cloud. To overcome this challenge, instead of sharing raw data, we can share a summary that is derived from the raw data, which is much smaller in size and does not include sensitive private information. When the data summary is appropriately constructed, it includes most of the information that is needed to train a good machine learning model.

In our ambulance use case, there are various types of data collected for analysis, such as the vital statistics of the patient, the status and location of the ambulance, the video feed from the ambulance to the hospital etc. This generates a lot of data to help train the models to better determine and predict the health of the patient, and determine if the ambulance is on the right route to be associated with the optimal network signal or network slice, and avoid disruptions such as a tunnel that can interrupt the signal or traffic jams that can delay the patient reaching the hospital. A key observation is that diverse/representative data that can truly enhance machine learning models are encountered rarely; the rest of the data is fairly predictable and redundant, and hence a summary of the most important data, accounting for data privacy, can suffice to train the models. Sending the summary data will also help reduce network traffic and prevent overloading the systems receiving the data, as only relevant data is sent to the receiving systems.

The following diagram, which was introduced in the first blog, provides an overview of what is occurring:

  1. Data is created as the ambulance moves across different locations
  2. The data is transferred through the 5G network to the network edge
  3. A summary data is sent from the network edge to the central cloud

The primary focus of this article is on step 3: Data summarization. We will discuss how coresets can be used to identify a small number of representative data points from a larger dataset. Coresets can be viewed as a special kind of sampling procedure, that is aware of its objective, i.e., the machine learning model that needs to be trained on the sampled data. Often the choice of the machine learning models that would be trained on the sampled data is not known prior to data collection.

IBM Research, Pennsylvania State University, and Army Research Laboratory have pioneered a technique called “Robust Coresets”. Robust coresets can be used to sample a dataset without knowing the exact type of machine learning model that will be trained on the sampled data. Instead, there is a special property of the machine learning model (called the Lipschitz constant) that determines the tradeoffs between the size of the sampled data and tolerance for error in the machine learning model (vis-a-vis training the model on the larger dataset). The technique can be used to both assess error in the machine model (relative to training on the larger dataset) for a given sample size and determine the sample size for a given error tolerance.

What is Lipschitz constant? Lipschitz constant of a function is a measure of how fast the function can change. Intuitively, given a linear classifier and a mean squared loss function, the loss function is quadratic in the input and thus its Lipschitz constant is two. A slow changing loss function has a smaller Lipschitz constant and thus lower error (for a given sample size) or lower sample size (for a given error). A key realization here is that all the information about the machine learning model that is required for sampling can be captured in this single Lipschitz constant. For many statistical machine learning models, the Lipschitz constant can be easily determined; for many deep learning models the Lipschitz constant can be empirically determined. Even with an approximate estimate of this constant, the sampling procedure is known to heuristically achieve good results.

Coresets, besides being a method of data sampling at the network edge, can revolutionize how value of information can be computed. In a crowdsourced environment with multiple data sources, it is often challenging to assess the value of information from each data source. The central cloud can periodically publish the coreset based on collected (sampled) data from multiple edges; this allows an edge to measure of the novelty (value) of its newly collected data with respect to the data already available in the central cloud. This measure of novelty can not only be used by the edge to filter redundant data (data with low novelty), but can also be used to compensate the data source based on the true richness and diversity of the data.

In addition to being a method of sampling, coresets also serve as a bridge to data privacy. Differential privacy is a well known technique for publicly sharing information about a dataset while withholding information about individuals in the dataset. Coreset-based techniques support injection of noise to the samples, in order to obtain a differentially private coreset. This opens up the possibility of crowdsourcing machine learning data from multiple edges, while protecting data privacy.

Returning to the ambulance use case, using the coreset technique, a small subset of the data can be used to train the models on identifying traffic jams or variation on patient’s health conditions etc. Coresets are able to identify the right set of data that needs to be transferred from the edge servers to central distribution points for further analysis or model training. Unnecessary data is discarded to reduce storage and computation at the edge and the network is not flooded with data, as only essential data is transferred.

The wide spread adoption of coresets as an edge AI function can accelerate data collection, training, and post deployment monitoring of machine learning models across multiple 5G/Edge scenarios. There are many other data related issues that the edge introduces. For example, how do we learn about the cohort of people that has a similar condition on an edge node without sharing the raw data at the edge node? While an option is that each edge node shares a coreset with the central server, an alternative approach is federated Learning which forms distributed cohorts and creates a global model without revealing raw personal data. Our next blog will examine Federated Learning Algorithms to enable you to implement robust edge data management solutions, which can work together with coresets to benefit from both techniques.


H. Lu, M.-J. Li, T. He, S. Wang, V. Narayanan, K. Chan, “Robust Coreset Construction for Distributed Machine Learning,” IEEE Journal on Selected Areas in Communications, vol. 38, no. 10, Oct. 2020.