Journal of Computer Science and Technology ›› 2020, Vol. 35 ›› Issue (1): 209-220.doi: 10.1007/s11390-020-9732-x

• Special Section on Applications • Previous Articles     Next Articles

A Case for Adaptive Resource Management in Alibaba Datacenter Using Neural Networks

Sa Wang1,2,3, Member, CCF, ACM, Yan-Hai Zhu4,*, Shan-Pei Chen4, Tian-Ze Wu1,2, Member, CCF, IEEE, Wen-Jie Li1,2, Xu-Sheng Zhan1,2, Hai-Yang Ding4, Wei-Song Shi5, Fellow, IEEE, Yun-Gang Bao1,2,3, Senior Member, CCF, Member, ACM, IEEE   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Peng Cheng Laboratory, Shenzhen 518055, China;
    4 Alibaba Inc., Hangzhou 311121, China;
    5 Department of Computer Science, Wayne State University, Michigan, MI 48202, U.S.A
  • Received:2019-05-22 Revised:2019-10-14 Online:2020-01-05 Published:2020-01-14
  • Contact: Yan-Hai Zhu
  • About author:Sa Wang received his B.S. degree from University of Science and Technology of China, Hefei, in 2009 and Ph.D. degree in computer science from the Chinese Academy of Sciences (CAS), Beijing, in 2016. He is an associate professor in ICT (Institute of Computing Technology), CAS. His current research interests include operating system, system performance evaluation and optimization, distributed system. He is a member of CCF and ACM.
  • Supported by:
    This work is supported in part by the National Key Research and Development Program of China under Grant No. 2016YFB1000201, the National Natural Science Foundation of China under Grant Nos. 61420106013 and 61702480, and the Youth Innovation Promotion Association of Chinese Academy of Sciences and Alibaba Innovative Research (AIR) Program.

Both resource efficiency and application QoS have been big concerns of datacenter operators for a long time, but remain to be irreconcilable. High resource utilization increases the risk of resource contention between co-located workload, which makes latency-critical (LC) applications suffer unpredictable, and even unacceptable performance. Plenty of prior work devotes the effort on exploiting effective mechanisms to protect the QoS of LC applications while improving resource efficiency. In this paper, we propose MAGI, a resource management runtime that leverages neural networks to monitor and further pinpoint the root cause of performance interference, and adjusts resource shares of corresponding applications to ensure the QoS of LC applications. MAGI is a practice in Alibaba datacenter to provide on-demand resource adjustment for applications using neural networks. The experimental results show that MAGI could reduce up to 87.3% performance degradation of LC application when co-located with other antagonist applications.

Key words: resource management, neural network, resource efficiency, tail latency

