Journal of Computer Science and Technology ›› 2020, Vol. 35 ›› Issue (1): 209-220.doi: 10.1007/s11390-020-9732-x

• Special Section on Applications • Previous Articles     Next Articles

A Case for Adaptive Resource Management in Alibaba Datacenter Using Neural Networks

Sa Wang1,2,3, Member, CCF, ACM, Yan-Hai Zhu4,*, Shan-Pei Chen4, Tian-Ze Wu1,2, Member, CCF, IEEE, Wen-Jie Li1,2, Xu-Sheng Zhan1,2, Hai-Yang Ding4, Wei-Song Shi5, Fellow, IEEE, Yun-Gang Bao1,2,3, Senior Member, CCF, Member, ACM, IEEE   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Peng Cheng Laboratory, Shenzhen 518055, China;
    4 Alibaba Inc., Hangzhou 311121, China;
    5 Department of Computer Science, Wayne State University, Michigan, MI 48202, U.S.A
  • Received:2019-05-22 Revised:2019-10-14 Online:2020-01-05 Published:2020-01-14
  • Contact: Yan-Hai Zhu
  • About author:Sa Wang received his B.S. degree from University of Science and Technology of China, Hefei, in 2009 and Ph.D. degree in computer science from the Chinese Academy of Sciences (CAS), Beijing, in 2016. He is an associate professor in ICT (Institute of Computing Technology), CAS. His current research interests include operating system, system performance evaluation and optimization, distributed system. He is a member of CCF and ACM.
  • Supported by:
    This work is supported in part by the National Key Research and Development Program of China under Grant No. 2016YFB1000201, the National Natural Science Foundation of China under Grant Nos. 61420106013 and 61702480, and the Youth Innovation Promotion Association of Chinese Academy of Sciences and Alibaba Innovative Research (AIR) Program.

Both resource efficiency and application QoS have been big concerns of datacenter operators for a long time, but remain to be irreconcilable. High resource utilization increases the risk of resource contention between co-located workload, which makes latency-critical (LC) applications suffer unpredictable, and even unacceptable performance. Plenty of prior work devotes the effort on exploiting effective mechanisms to protect the QoS of LC applications while improving resource efficiency. In this paper, we propose MAGI, a resource management runtime that leverages neural networks to monitor and further pinpoint the root cause of performance interference, and adjusts resource shares of corresponding applications to ensure the QoS of LC applications. MAGI is a practice in Alibaba datacenter to provide on-demand resource adjustment for applications using neural networks. The experimental results show that MAGI could reduce up to 87.3% performance degradation of LC application when co-located with other antagonist applications.

Key words: resource management, neural network, resource efficiency, tail latency

[1] Reiss C, Tumanov A, Ganger G R, Katz R H, Kozuch M A. Heterogeneity and dynamicity of clouds at scale:Google trace analysis. In Proc. the 3rd ACM Symposium on Cloud Computing, October 2012, Article No. 7.
[2] Liu H. A measurement study of server utilization in public clouds. In Proc. the 9th IEEE International Conference on Dependable, Autonomic and Secure Computing, December 2011, pp.435-442.
[3] Delimitrou C, Kozyrakis C. Quasar:Resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices, 2014, 49(4):127-144.
[4] Cortez E, Bonde A, Muzio A, Russinovich M, Fontoura M, Bianchini R. Resource central:Understanding and predicting workloads for improved resource management in large cloud platforms. In Proc. the 26th Symposium on Operating Systems Principles, October 2017, pp.153-167.
[5] Lo D, Cheng L Q, Govindaraju R, Ranganathan P, Kozyrakis C. Heracles:Improving resource efficiency at scale. ACM SIGARCH Computer Architecture News, 2015, 43:450-462.
[6] Chen S, Delimitrou C, Martínez J F. PARTIES:QoS-aware resource partitioning for multiple interactive services. In Proc. the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, April 2019, pp.107-120.
[7] Zhuravlev S, Blagodurov S, Fedorova A. Addressing shared resource contention in multicore processors via scheduling. ACM SIGPLAN Notices, 2010, 45:129-142.
[8] Zhang X, Tune E, Hagmann R et al. CPI2:CPU performance isolation for shared compute clusters. In Proc. the 8th ACM European Conference on Computer Systems, April 2013, pp.379-391.
[9] Yasin A. A top-down method for performance analysis and counters architecture. In Proc. the 2014 IEEE International Symposium on Performance Analysis of Systems and Software, March 2014, pp.35-44.
[10] Kasture H, Sanchez D. Tailbench:A benchmark suite and evaluation methodology for latency-critical applications. In Proc. the 2016 IEEE International Symposium on Workload Characterization, September 2016, pp.3-12.
[11] Henning J L. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News, 2006, 34(4):1-17.
[12] Verma A, Pedrosa L, Korupolu M, Oppenheimer D, Tune E, Wilkes J. Large-scale cluster management at Google with Borg. In Proc. the 10th European Conference on Computer Systems, April 2015, Article No. 18.
[13] Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph A D, Katz R H, Shenker S, Stoica I. Mesos:A platform for fine-grained resource sharing in the data center. In Proc. the 8th USENIX Symposium on Networked Systems Design and Implementation, March 2011, Article No. 4.
[14] Schwarzkopf M, Konwinski A, Abd-El-Malek M, Wilkes J. Omega:Flexible, scalable schedulers for large compute clusters. In Proc. the 8th ACM European Conference on Computer Systems, April 2013, pp.351-364.
[15] Ousterhout K, Wendell P, Zaharia M, Stoica I. Sparrow:Distributed, low latency scheduling. In Proc. the 24th ACM Symposium on Operating Systems Principles, November 2013, pp.69-84.
[16] Zhang Z, Li C, Tao Y Y, Yang R Y, Tang H, Xu J. Fuxi:A fault-tolerant resource management and job scheduling system at Internet scale. Proceedings of the VLDB Endowment, 2014, 7(13):1393-1404.
[17] Guo J, Chang Z H, Wang S, Ding H Y, Feng Y H, Mao L, Bao Y G. Who limits the resource efficiency of my datacenter:An analysis of Alibaba datacenter traces. In Proc. the International Symposium on Quality of Service, June 2019, Article No. 39.
[18] Herdrich A, Verplanke E, Autee P, Illikkal R, Gianos C, Singhal R, Iyer R. Cache QoS:From concept to reality in the intelr Xeonr processor E5-2600 v3 product family. In Proc. the 2016 IEEE International Symposium on High Performance Computer Architecture, March 2016, pp.657-668.
[1] Yang-Jie Cao, Shuang Wu, Chang Liu, Nan Lin, Yuan Wang, Cong Yang, Jie Li. Seg-CapNet: A Capsule-Based Neural Network for the Segmentation of Left Ventricle from Cardiac Magnetic Resonance Imaging [J]. Journal of Computer Science and Technology, 2021, 36(2): 323-333.
[2] Zhang-Jin Huang, Xiang-Xiang He, Fang-Jun Wang, Qing Shen. A Real-Time Multi-Stage Architecture for Pose Estimation of Zebrafish Head with Convolutional Neural Networks [J]. Journal of Computer Science and Technology, 2021, 36(2): 434-444.
[3] Bo-Wei Zou, Rong-Tao Huang, Zeng-Zhuang Xu, Yu Hong, Guo-Dong Zhou. Language Adaptation for Entity Relation Classification via Adversarial Neural Networks [J]. Journal of Computer Science and Technology, 2021, 36(1): 207-220.
[4] Yue-Huan Wang, Ze-Nan Li, Jing-Wei Xu, Ping Yu, Taolue Chen, Xiao-Xing Ma. Predicted Robustness as QoS for Deep Neural Network Models [J]. Journal of Computer Science and Technology, 2020, 35(5): 999-1015.
[5] Bi-Ying Yan, Chao Yang, Pan Deng, Qiao Sun, Feng Chen, Yang Yu. A Spatiotemporal Causality Based Governance Framework for Noisy Urban Sensory Data [J]. Journal of Computer Science and Technology, 2020, 35(5): 1084-1098.
[6] Rui-Song Zhang, Wei-Ze Quan, Lu-Bin Fan, Li-Ming Hu, Dong-Ming Yan. Distinguishing Computer-Generated Images from Natural Images Using Channel and Pixel Correlation [J]. Journal of Computer Science and Technology, 2020, 35(3): 592-602.
[7] Ying Ding, Jun-Hui Li, Zheng-Xian Gong, Guo-Dong Zhou. Word-Pair Relevance Modeling with Multi-View Neural Attention Mechanism for Sentence Alignment [J]. Journal of Computer Science and Technology, 2020, 35(3): 617-628.
[8] Zheng Zeng, Lu Wang, Bei-Bei Wang, Chun-Meng Kang, Yan-Ning Xu. Denoising Stochastic Progressive Photon Mapping Renderings Using a Multi-Residual Network [J]. Journal of Computer Science and Technology, 2020, 35(3): 506-521.
[9] Jin-Gong Jia, Yuan-Feng Zhou, Xing-Wei Hao, Feng Li, Christian Desrosiers, Cai-Ming Zhang. Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition [J]. Journal of Computer Science and Technology, 2020, 35(3): 538-550.
[10] Dun Liang, Yuan-Chen Guo, Shao-Kui Zhang, Tai-Jiang Mu, Xiaolei Huang. Lane Detection: A Survey with New Results [J]. Journal of Computer Science and Technology, 2020, 35(3): 493-505.
[11] Shu-Quan Wang, Lei Wang, Yu Deng, Zhi-Jie Yang, Sha-Sha Guo, Zi-Yang Kang, Yu-Feng Guo, Wei-Xia Xu. SIES: A Novel Implementation of Spiking Convolutional Neural Network Inference Engine on Field-Programmable Gate Array [J]. Journal of Computer Science and Technology, 2020, 35(2): 475-489.
[12] Yun-Yun Wang, Jian-Min Gu, Chao Wang, Song-Can Chen, Hui Xue. Discrimination-Aware Domain Adversarial Neural Network [J]. Journal of Computer Science and Technology, 2020, 35(2): 259-267.
[13] Wen-Li Zhang, Ke Liu, Yi-Fan Shen, Ya-Zhu Lan, Hui Song, Ming-Yu Chen, Yuan-Fei Chen. Labeled Network Stack: A High-Concurrency and Low-Tail Latency Cloud Server Framework for Massive IoT Devices [J]. Journal of Computer Science and Technology, 2020, 35(1): 179-193.
[14] Xing-Gang Wang, Jia-Si Wang, Peng Tang, Wen-Yu Liu. Weakly- and Semi-Supervised Fast Region-Based CNN for Object Detection [J]. Journal of Computer Science and Technology, 2019, 34(6): 1269-1278.
[15] Xin Yang, Dawei Wang, Wenbo Hu, Li-Jing Zhao, Bao-Cai Yin, Qiang Zhang, Xiao-Peng Wei, Hongbo Fu. DEMC: A Deep Dual-Encoder Network for Denoising Monte Carlo Rendering [J]. Journal of Computer Science and Technology, 2019, 34(5): 1123-1135.
Full text



[1] Chen Shihua;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[2] Feng Yulin;. Recursive Implementation of VLSI Circuits[J]. , 1986, 1(2): 72 -82 .
[3] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[4] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[5] Pan Qijing;. A Routing Algorithm with Candidate Shortest Path[J]. , 1986, 1(3): 33 -52 .
[6] Zhang Cui; Zhao Qinping; Xu Jiafu;. Kernel Language KLND[J]. , 1986, 1(3): 65 -79 .
[7] Qu Yanwen;. AGDL: A Definition Language for Attribute Grammars[J]. , 1986, 1(3): 80 -91 .
[8] Shen Li; Stephen Y.H.Su;. Generalized Parallel Signature Analyzers with External Exclusive-OR Gates[J]. , 1986, 1(4): 49 -61 .
[9] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[10] Huang Xuedong; Cai Lianhong; Fang Ditang; Chi Bianjin; Zhou Li; Jiang Li;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
  Copyright ©2015 JCST, All Rights Reserved