As deep learning models become more and more popular in real-world business applications and training datasets grow very large, machine learning (ML) infrastructure is becoming a critical issue in many companies.
To help you stay aware of the latest research advances in ML infrastructure, we’ve summarized some of the most important research papers recently introduced in this area. As you read these summaries, you will be able to learn from the experience of the leading tech companies, including Google, Microsoft, and LinkedIn.
The papers we’ve selected cover data labeling and data validation frameworks, different approaches to distributed training of ML models, a novel approach to tracking ML model performance in production, and more.
If these accessible AI research analyses & summaries are useful for you, you can subscribe to receive our regular industry updates below.
If you’d like to skip around, here are the papers we’ve summarized:
- Snorkel: Rapid Training Data Creation with Weak Supervision
- Data Validation for Machine Learning
- Beyond Data and Model Parallelism for Deep Neural Networks
- Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD
- Chiron: Privacy-preserving Machine Learning as a Service
- Towards Federated Learning at Scale: System Design
- TonY: An Orchestrator for Distributed Machine Learning Jobs
- KnowledgeNet: Disaggregated and Distributed Training and Serving of Deep Neural Networks
- Deep Learning Inference Service at Microsoft
- MPP: Model Performance Predictor
ML Infrastructure Research Papers
1. Snorkel: Rapid Training Data Creation with Weak Supervision, by Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, Christopher Ré
Original Abstract
Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8x faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8x speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets.
Our Summary
The Stanford research team addresses the problem of creating large training datasets for supervised learning. To this end, they present Snorkel, an end-to-end system that enables users to write labeling functions to combine several weak supervision sources in a smart way instead of hand-labeling the training datasets. The experiments demonstrate that Snorkel significantly reduces time and costs spent on data labeling, while models trained with data that was labeled with Snorkel approach the performance of models trained with large, hand-labeled datasets.
What’s the core idea of this paper?
- Hand-labeling of datasets is time-consuming and costly, while datasets labeled through weak supervision often result in limited accuracy and coverage.
- Snorkel offers a better way to combine weak supervision sources by denoising their outputs without access to the ground truth.
- Snorkel’s workflow proceeds in three main stages:
- Users write labeling functions to express various weak supervision sources such as patterns, heuristics, external knowledge bases, etc.
- Snorkel automatically learns a generative model over the labeling functions to estimate their accuracies and correlations.
- The system outputs a set of probabilistic labels.
What’s the key achievement?
- Snorkel increases the productivity of subject matter experts, allowing them to build models much faster.
- Models trained using Snorkel outperform models trained via distant supervision by an average of 132% and approach the accuracy of models trained on hand-curated data.
What does the AI community think?
- The paper was presented at VLDB 2018, the 44th International Conference on Very Large Data Bases.
What are possible business applications?
- Snorkel is already being deployed for more efficient data labeling by industry, science, and government research groups.
- It is being applied in knowledge base construction, image analysis, bioinformatics, fraud detection, and more.
Where can you get implementation code?
- Snorkel’s implementation is open-source.
2. Data Validation for Machine Learning, by Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, Martin Zinkevich
Original Abstract
Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine learning research has focused on improving the accuracy and efficiency of training and inference algorithms, there is less attention in the equally important problem of monitoring the quality of data fed to machine learning. The importance of this problem is hard to dispute: errors in the input data can nullify any benefits on speed and accuracy for training and inference. This argument points to a data-centric approach to machine learning that treats training and serving data as an important production asset, on par with the algorithm and infrastructure used for learning.
In this paper, we tackle this problem and present a data validation system that is designed to detect anomalies specifically in data fed into machine learning pipelines. This system is deployed in production as an integral part of TFX – an end-to-end machine learning platform at Google. It is used by hundreds of product teams to continuously monitor and validate several petabytes of production data per day. We faced several challenges in developing our system, most notably around the ability of ML pipelines to soldier on in the face of unexpected patterns, schema-free data, or training/serving skew. We discuss these challenges, the techniques we used to address them, and the various design choices that we made in implementing the system. Finally, we present evidence from the system’s deployment in production that illustrate the tangible benefits of data validation in the context of ML: early detection of errors, model-quality wins from using better data, savings in engineering hours to debug problems, and a shift towards data-centric workflows in model development.
Our Summary
The Google Research team addresses the problem of data quality in machine learning. Specifically, they introduce a data-validation framework with three key components. The data analyzer component creates summary statistics that can be used to validate data, the data validator component checks data properties against the defined schema and the model unit tester uses synthetic input data to check the training code for errors propagated by erroneous input data at an earlier stage. The introduced data validation system is fully integrated into TFX, an end-to-end machine learning platform at Google, and is used by hundreds of product teams for validating several petabytes of production data per day.
What’s the core idea of this paper?
- The data quality problem is hard to overestimate, especially for production pipelines, where the serving data eventually becomes training data.
- The three components of Google’s data validation system, namely the data analyzer, data validator, and model unit tester, support the following types of data validation:
- Single-batch validation to detect anomalies in a single batch of data.
- Inter-batch validation to track significant changes between the training and serving data or between successive batches of the training data.
- Model testing to detect any assumptions in the training code that are not reflected in the data (e.g., taking the logarithm of a feature that is a string).
What’s the key achievement?
- The data validation framework is deployed across more than 700 machine learning pipelines.
- Test logs over a period of one month showed that model unit tests were executed over 80K times with 6% of these executions indicating failures because of the incorrect assumptions in the training code or underspecified schema.
- In one of the case studies described in the paper, the data validation framework made it possible to detect a data error and find the root cause in two days, compared to months in similar past instances.
What does the AI community think?
- The paper was presented at the SysML 2019 conference.
What are possible business applications?
- The data validation system can be integrated into different machine learning platforms to improve the quality of the data fed to ML models in production.
Where can you get implementation code?
- The Google Research team has open-sourced the libraries that implement the core techniques described in the paper.
3. Beyond Data and Model Parallelism for Deep Neural Networks, by Zhihao Jia, Matei Zaharia, Alex Aiken
Original Abstract
The computational requirements for training deep neural networks (DNNs) have grown to the point that it is now standard practice to parallelize training. Existing deep learning systems commonly use data or model parallelism, but unfortunately, these strategies often result in suboptimal parallelization performance.
In this paper, we define a more comprehensive search space of parallelization strategies for DNNs called SOAP, which includes strategies to parallelize a DNN in the Sample, Operation, Attribute, and Parameter dimensions. We also propose FlexFlow, a deep learning framework that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine. To accelerate this search, FlexFlow introduces a novel execution simulator that can accurately predict a parallelization strategy’s performance and is three orders of magnitude faster than prior approaches that have to execute each strategy. We evaluate FlexFlow with six real-world DNN benchmarks on two GPU clusters and show that FlexFlow can increase training throughput by up to 3.8x over state-of-the-art approaches, even when including its search time, and also improves scalability.
Our Summary
The research team from Stanford University proposes a novel approach to discovering the best parallelization strategies for training large deep learning models. First of all, they propose to parallelize deep neural networks in the sample, operation, attribute, and parameter (SOAP) dimensions. Next, the researchers introduce FlexFlow, a deep neural framework that leverages guided randomized search to find the best parallelization strategies over this broad search space. The search is accelerated through an execution stimulator that quickly predicts the performance of a candidate parallelization strategy without requiring actual strategy execution. The team demonstrates that FlexFlow is able to increase deep neural network training throughput by 3.8× and reduce the communication costs by 5× compared to state-of-the-art parallelization approaches.
What’s the core idea of this paper?
- The research team defines a new search space for finding the fastest parallelization strategy. This search space includes sample, operation, attribute, and parameter (SOAP) dimensions.
- Then, the authors introduce FlexFlow, a deep learning framework that can search for the best parallelization strategy throughout this entire SOAP space:
- Execution time is kept low by using an execution simulator, which predicts the performance of parallelization strategies without needing to execute them.
- An execution optimizer leverages the results from the execution simulator and iteratively proposes candidate parallelization strategies using a Markov chain Monte Carlo search algorithm.
What’s the key achievement?
- FlexFlow outperforms the state-of-the-art parallelization approaches while improving scalability by:
- increasing training throughput by up to 3.8 times;
- reducing communication costs by up to 5 times.
What does the AI community think?
- The paper was presented at the SysML 2019 conference.
What are future research areas?
- Exploring the possibility of extending the introduced approach to applications in which execution time is data-dependent.
What are possible business applications?
- The suggested framework can significantly increase training throughput by finding the fastest parallelization strategies, and this can be applied to a variety of deep learning tasks including image and text classification, neural machine translation, and language modeling.
Where can you get implementation code?
- FlexFlow implementation is available on GitHub.
4. Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD, by Jianyu Wang and Gauri Joshi
Original Abstract
Large-scale machine learning training, in particular distributed stochastic gradient descent, needs to be robust to inherent system variability such as node straggling and random communication delays. This work considers a distributed training framework where each worker node is allowed to perform local model updates and the resulting models are averaged periodically. We analyze the true speed of error convergence with respect to wall-clock time (instead of the number of iterations), and analyze how it is affected by the frequency of averaging. The main contribution is the design of AdaComm, an adaptive communication strategy that starts with infrequent averaging to save communication delay and improve convergence speed, and then increases the communication frequency in order to achieve a low error floor. Rigorous experiments on training deep neural networks show that AdaComm can take 3× less time than fully synchronous SGD, and still reach the same final training loss.
Our Summary
The research team from Carnegie Mellon University addresses the performance of distributed stochastic gradient descent (SGD) in a real-world setting with inherent variability in the computing infrastructure. Specifically, the researchers consider a framework where every worker node is allowed to perform local model updates with periodic averaging of the resulting models. To make the distributed SGD fast and yet robust, the authors introduce AdaComm, an adaptive communication strategy, which implies starting with infrequent averaging to speed up convergence and then increasing the frequency of averaging to reduce model error rates. The experiments demonstrate that AdaComm can reduce training time by 3× compared to fully synchronous distributed SGD while achieving comparable error rates.
What’s the core idea of this paper?
- Distributed SGD implementation increases the amount of data processed per iteration but exposes SGD to unpredictable node slowdown and communication delays.
- To address this problem, the research team first quantified the runtime speed-up of periodic-averaging SGD over fully synchronous SGD and discovered that the periodic-averaging strategy reduces communication delays and mitigates synchronization delays.
- Following these findings, the researchers have introduced AdaComm, an adaptive communication strategy, which uses a modified form of periodic-averaging SGD to reduce runtime without raising error rates:
- AdaComm starts with infrequent averaging to maximize convergence speed.
- As the model gets closer to convergence, the frequency of averaging is increased to lower the error floor of the model.
What’s the key achievement?
- The experiments demonstrate that AdaComm significantly speeds up convergence while achieving the same error floor as synchronous SGD:
- On VGG-16, AdaComm achieves convergence in 11.5 minutes, compared to 38 minutes for fully synchronous SGD, to get a training loss of 4.5 × 10-2.
- On ResNet-50, the adaptive communication strategy gets to convergence in 15 minutes, compared to 21.5 minutes for fully synchronous SGD, to achieve a training loss of 3 × 10-2.
What does the AI community think?
- The paper was presented at the SysML 2019 conference.
What are future research areas?
- Generalizing the adaptive communication strategy to other SGD frameworks, including elastic-averaging, decentralized SGD, and parameter server-based training.
What are possible business applications?
- The introduced approach to speeding up the model convergence can be applied to any use case where large deep neural networks are used.
5. Chiron: Privacy-preserving Machine Learning as a Service, by Tyler Hunt, Congzheng Song, Reza Shokri, Vitaly Shmatikov, Emmett Witchel
Original Abstract
Major cloud operators offer machine learning (ML) as a service, enabling customers who have the data but not ML expertise or infrastructure to train predictive models on this data. Existing ML-as-a-service platforms require users to reveal all training data to the service operator. We design, implement, and evaluate Chiron, a system for privacy-preserving machine learning as a service. First, Chiron conceals the training data from the service operator. Second, in keeping with how many existing ML-as-a-service platforms work, Chiron reveals neither the training algorithm nor the model structure to the user, providing only black-box access to the trained model. Chiron is implemented using SGX enclaves, but SGX alone does not achieve the dual goals of data privacy and model confidentiality. Chiron runs the standard ML training toolchain (including the popular Theano framework and C compiler) in an enclave, but the untrusted model-creation code from the service operator is further confined in a Ryoan sandbox to prevent it from leaking the training data outside the enclave. To support distributed training, Chiron executes multiple concurrent enclaves that exchange model parameters via a parameter server. We evaluate Chiron on popular deep learning models, focusing on benchmark image classification tasks such as CIFAR and ImageNet, and show that its training performance and accuracy of the resulting models are practical for common uses of ML-as-a-service.
Our Summary
Machine learning as a service is offered by Google, Amazon, Microsoft, and other companies as a way to allow businesses without machine learning expertise to harness this technology for their data. However, cloud-based machine learning as a service brings concerns about data privacy, as the machine learning operator may intentionally or unintentionally access clients’ data or the data may be probed by malicious outsiders. In this paper, the authors propose Chiron, a system that allows clients to use machine learning as a service without revealing their training data to a service provider, while the service provider is able to keep the model parameters confidential and provides only a black box solution to its clients, following common practice. Chiron protects data using a confined Ryoan sandbox, which does not allow operator code to exfiltrate clients’ data, while keeping the machine learning code in use unavailable to clients.
What’s the core idea of this paper?
- Machine learning as a service currently does not protect users’ data in the cloud from the machine learning operator or malicious third parties.
- This paper presents Chiron, a system that protects users’ data and operators’ machine learning code by preventing either party from accessing information they are not intended to access:
- The system uses a Ryoan sandbox built on a hardware-protected enclave such as Intel’s SGX. It enables the service provider’s code to access the user’s data for defining and training the model but prevents it from exfiltrating data.
- Users see that the enclave is executing a Ryoan sandbox extended with standard ML toolchain code, but don’t have access to the specifics of the model being trained.
What’s the key achievement?
- The researchers successfully developed a system in which machine learning algorithms can be applied but the training data is not disclosed to the service providers, while the code remains hidden from data owners.
- The experiments demonstrate that the use of Chiron is practical and has only a moderate negative effect on the speed and accuracy of the resulting models:
- Chiron slows down ImageNetLite training by 16% but preserves the accuracy of the trained model.
- When the ideal parameter exchange policy is applied, Chiron lags the baseline on the CIFAR task by 4-20%.
What are possible business applications?
- Increased use of machine learning as a service in industries that require privacy by law (e.g., healthcare).
6. Towards Federated Learning at Scale: System Design, by Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečný, Stefano Mazzocchi, H. Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, Jason Roselander
Original Abstract
Federated Learning is a distributed machine learning approach which enables model training on a large corpus of decentralized data. We have built a scalable production system for Federated Learning in the domain of mobile devices, based on TensorFlow. In this paper, we describe the resulting high-level design, sketch some of the challenges and their solutions, and touch upon the open problems and future directions.
Our Summary
The Google research team introduces a new scalable production system for Federated Learning (FL). This system allows a deep neural network to be trained on data stored on a mobile phone. The weights from multiple devices are combined in the cloud to get a global model, which is then pushed back to phones for inference. The communication protocol addresses numerous practical issues, including limited device storage and computational resources, unreliable connectivity, time zone dependency, etc. Currently, the system is at the stage of maturity when it can be deployed in production over tens of millions of real-world devices. Several problems are still to be solved and the researchers anticipate that the system will then be able to work with billions of devices.
What’s the core idea of this paper?
- The paper presents a scalable production system for Federated Learning that allows a deep neural network to be trained, using TensorFlow, on data residing on millions of mobile devices.
- The communication protocol of this system enables devices to participate in the training of a global model through multiple rounds, where each round consists of three phases:
- Selection. The server selects a subset of connected devices that meet the eligibility criteria (e.g., charging and connected to an unmetered network). Typically, a few hundred devices participate in each round.
- Configuration. The server is configured following the selected aggregation mechanism. Then, the global model, together with the FL plan and an FL checkpoint, are sent to each of the devices.
- Reporting. The participating devices report updates to the server, which aggregates these updates using Federated Averaging. The global model is updated if enough devices report in time. Otherwise, the round is abandoned.
What’s the key achievement?
- Introducing a new scalable production system for Federated Learning.
- Identifying continuing challenges in Federated Learning and avenues for future work.
What does the AI community think?
- The paper was presented at the SysML 2019 conference.
What are future research areas?
- Reducing bias arising from the fact that devices are not equally likely to participate in each round (e.g., phones without access to unmetered networks and with memory below 2 GB do not participate).
- Reducing convergence time for Federated Learning.
- Generalizing Federated Learning to Federated Computation, which is not restricted to machine learning and TensorFlow, but can handle general MapReduce-like workloads.
What are possible business applications?
- Federated Learning is most applicable in situations where the on-device data is more relevant than data stored on servers – for example, when devices generate the data, or when data is privacy-sensitive.
- The authors suggest the following applications for their system:
- on-device item ranking;
- content suggestions for on-device keyboards;
- next word prediction.
7. TonY: An Orchestrator for Distributed Machine Learning Jobs by Anthony Hsu, Keqiu Hu, Jonathan Hung, Arun Suresh, Zhe Zhang
Original Abstract
Training machine learning (ML) models on large datasets requires considerable computing power. To speed up training, it is typical to distribute training across several machines, often with specialized hardware like GPUs or TPUs. Managing a distributed training job is complex and requires dealing with resource contention, distributed configurations, monitoring, and fault tolerance. In this paper, we describe TonY, an opensource orchestrator for distributed ML jobs built at LinkedIn to address these challenges.
Our Summary
The LinkedIn research team addresses the key challenges of distributed model training. This includes fighting for the same memory, CPU, and GPU resources, tedious and error-prone configuration, and lack of monitoring and fault tolerance. To overcome these challenges, the LinkedIn team has built and open-sourced TonY, an orchestrator for distributed machine learning jobs. TonY consists of a client to allow users to submit their jobs and an application that allocates resources, sets up configurations and launches the ML jobs in a distributed fashion.
What’s the core idea of this paper?
- Following the most commonly used ML framework and scheduler at LinkedIn, the initial implementation of TonY includes support for running distributed TensorFlow jobs on Hadoop YARN (Yet Another Resource Negotiator).
- TonY consists of the following components:
- A client for submitting jobs to a scheduler. Users describe in an XML file the resources required by their job, provide the path to their ML program and the virtual environment or Docker image where their program should run on the cluster. They may also specify additional properties, including model-specific hyperparameters, input data, and output location.
- An application that runs in the scheduler. Here the ResourceManager allocates task containers to the TonY ApplicationMaster. Afterward, the TaskExecutor is launched in each task container.
What’s the key achievement?
- Introducing a tool that manages all distributed training jobs, ensures fault tolerance, allows the training job to be monitored and visualized, and allows metrics to be collected about the tasks’ performance and resource utilization.
What are future research areas?
- Implementing new features in TonY, such as aggregation and analysis of task performance in a UI to suggest new settings for the machine learning jobs.
What are possible business applications?
- With TonY, ML engineers can avoid writing ad-hoc scripts for launching distributed ML jobs on a pool of machines without resource guarantees. This can dramatically improve the efficiency of distributed model training.
Where can you get implementation code?
- TonY is open-source and available on GitHub.
8. KnowledgeNet: Disaggregated and Distributed Training and Serving of Deep Neural Networks, by Saman Biookaghazadeh, Yitao Chen, Kaiqi Zhao, and Ming Zhao
Original Abstract
Deep Neural Networks (DNNs) have a significant impact on numerous applications, such as video processing, virtual/augmented reality, and text processing. The everchanging environment forces the DNN models to evolve, accordingly. Also, the transition from the cloud-only to edgecloud paradigm has made the deployment and training of these models challenging. Addressing these challenges requires new methods and systems for continuous training and distribution of these models in a heterogeneous environment.
In this paper, we propose KnowledgeNet (KN), which is a new architectural technique for a simple disaggregation and distribution of the neural networks for both training and serving. Using KN, DNNs can be partitioned into multiple small blocks and be deployed on a distributed set of computational nodes. Also, KN utilizes the knowledge transfer technique to provide small scale models with high accuracy in edge scenarios with limited resources. Preliminary results show that our new method can ensure a state-of-the-art accuracy for a DNN model while being disaggregated among multiple workers. Also, by using knowledge transfer technique, we can compress the model by 62% for deployment, while maintaining the same accuracy.
Our Summary
The researchers from Arizona State University propose KnowledgeNet (KN), a new architectural technique for disaggregated and distributed model training and serving. The approach suggests splitting a large deep neural network into several small models, with each of these models being deployed on an independent processing node. They then use the synthetic gradient method to generate the target gradient for each section, asynchronously. The authors also develop a new knowledge transfer technique, where a model deployed on the edge receives supervision from the oracle model on the cloud. The experiments demonstrate that the synthetic gradient approach presented in the paper achieves accuracy comparable to the conventional backpropagation approach, while the knowledge transfer technique results in accuracy that is slightly lower compared to the large-scale model.
Representation of the model in the KN setting. Each dash box is being mapped onto a distinct processor. Also, the synthetic gradients are being generated asynchronously, using extra components (represented as M blocks).
What’s the core idea of this paper?
- The proposed KnowledgeNet technique utilizes two methods for enabling disaggregated and distributed model training and serving:
- When a large neural network is split into multiple small models, the conventional training approach suffers from the lack of communication between the layers in a model. To address this problem, the researchers suggest using synthetic gradient descent to generate the target gradient for each section, asynchronously.
- Next, to compress the model so that it can be deployed on the edge devices with limited computational capacities, the researchers introduce a knowledge transfer technique, which implies transforming the deep neural network into two equivalent models: (1) a large oracle model on the cloud and (2) a small counterpart model on the edge, with a small model receiving supervision from the oracle model.
What’s the key achievement?
- The experiments on the MNIST dataset with one convolutional layer and three fully-connected layers show that after training for 500K iterations the synthetic gradient approach achieves 97.7% accuracy, which is comparable to the 98.4% accuracy achieved by the backpropagation approach.
- Evaluation of the knowledge transfer technique demonstrates that an independent student model with 3.2M parameters that leverages the knowledge of the larger network achieves slightly lower performance (61.24%) than the teacher model VGG16 with 8.5M parameters (74.12%).
What are possible business applications?
- The introduced approach can be used for disaggregation and distribution of deep neural networks to enable continuous training of these models on edge devices with limited computational capacities.
9. Deep Learning Inference Service at Microsoft, by Jonathan Soifer, Jason Li, Mingqin Li, Jeffrey Zhu, Yingnan Li, Yuxiong He, Elton Zheng, Adi Oltean, Maya Mosyak, Chris Barnes, Thomas Liu, Junhua Wang
Original Abstract
This paper introduces the Deep Learning Inference Service, an online production service at Microsoft for ultra-low-latency deep neural network model inference. We present the system architecture and deep dive into core concepts such as intelligent model placement, heterogeneous resource management, resource isolation, and efficient routing. We also present production scale and performance numbers.
Our Summary
The Microsoft research team introduces their production system for ultra-low-latency deep neural network model inference. This Deep Learning Inference System (DLIS) consists of Model Master (MM), a singleton orchestrator for intelligent provisioning of model containers onto one or more servers, and Model Servers (MS) for routing and model execution. Currently, DLIS is serving three million calls per second across tens of thousands of model instances.
DLIS architecture
What’s the core idea of this paper?
- Different models perform differently across hardware; for example, convolutional networks show better performance on GPUs, while recurrent neural networks perform better on FPGAs or CPUs. DLIS is able to understand the requirements of deep neural networks and place them intelligently onto machine hardware.
- To ensure that model instances do not interfere with each other, model servers isolate model instances in containers. Resource isolation is enforced in the form of processor affinity, hardware-supported affinity, and memory restrictions.
- Because of the frequent burst traffic, Microsoft needs to ensure efficient routing. The MS router supports backup requests in case the first request has a high chance of falling short of the SLA standard. Specifically, the Microsoft team claims that they achieve the best latency improvement with the least amount of extra computation when backup requests are sent at 2ms intervals with cross-server cancellation.
What’s the key achievement?
- The DLIS platform serves as the inference backend for many tasks across Microsoft, including web search, advertising, and Office intelligence.
- It handles several million inference calls per second, served from tens of thousands of model instances.
- MS is flexible, runs on both Windows and Linux, and supports orchestrators outside of MM, including YARN and Kubernetes.
What are possible business applications?
- The introduced system can be used to ensure the efficient deployment of deep neural networks on heterogeneous data-center hardware.
10. MPP: Model Performance Predictor, by Sindhu Ghanta, Sriram Subramanian, Lior Khermosh, Harshil Shah, Yakov Goldberg, Swaminathan Sundararaman, Drew Roselli, Nisha Talagala
Original Abstract
Operations is a key challenge in the domain of machine learning pipeline deployments involving monitoring and management of real-time prediction quality. Typically, metrics like accuracy, RMSE etc., are used to track the performance of models in deployment. However, these metrics cannot be calculated in production due to the absence of labels.We propose using an ML algorithm, Model Performance Predictor (MPP), to track the performance of the models in deployment. We argue that an ensemble of such metrics can be used to create a score representing the prediction quality in production. This in turn facilitates formulation and customization of ML alerts, that can be escalated by an operations team to the data science team. Such a score automates monitoring and enables ML deployments at scale.
Our Summary
The usual metrics like accuracy and RMSE cannot be applied in production because of the absence of labels. To track the performance of models in deployment, the ParallelM research team introduces Model Performance Predictor (MPP), an ML algorithm that predicts the error rate of a model in production. The algorithm is trained on the error dataset created from prediction errors made by the primary algorithm. The experiments on classification and regression tasks demonstrate that the MPP algorithm is able to track the performance of the primary algorithm in most cases.
What’s the core idea of this paper?
- A model’s performance in production depends on the particular data it receives, and this data can significantly vary with external factors. Thus, it’s extremely important to track the performance of models in deployment.
- While the primary model focuses on the main prediction task, the MPP algorithm, introduced in this paper, tries to predict the performance of the primary model.
- The suggested algorithm is trained on the error dataset which comes from the prediction errors of the primary algorithm.
- MPP is basically a binary classification algorithm that predicts whether a prediction of the primary model is correct (1) or incorrect (0):
- For classification tasks, the categorization into correct and incorrect predictions is straightforward.
- For regression tasks, the prediction is assumed to be correct as long as the error is within ±ɛ of the true value. The value of ɛ is derived from the knee of the REC curve.
What’s the key achievement?
- The experiments on five classification tasks and five regression tasks demonstrate that in most cases, the MPP prediction is very close to the actual error of the primary algorithm. Specifically, in 6 out of 10 tasks, the absolute difference between the predicted error and the actual error was within 0.03.
What are possible business applications?
- The MPP algorithm can assist operations teams in monitoring and managing deployed machine learning models by preventing catastrophic predictions.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Leave a Reply
You must be logged in to post a comment.