Journal of Computer Science and Technology ›› 2020, Vol. 35 ›› Issue (1): 221-230.doi: 10.1007/s11390-019-1951-7

• Special Section on Applications • Previous Articles    

AquaSee: Predict Load and Cooling System Faults of Supercomputers Using Chilled Water Data

Yu-Qi Li1, Li-Quan Xiao2, Jing-Hua Feng1,2, Bin Xu1, Jian Zhang1   

  1. 1 National Supercomputer Center in Tianjin, Tianjin 300450, China;
    2 College of Computer, National University of Defense Technology, Changsha 410073, China
  • Received:2019-05-21 Revised:2019-08-19 Online:2020-01-05 Published:2020-01-14
  • About author:Yu-Qi Li got his Bachelor's degree in computer science from Nanchang University, Nanchang, in 2012, and got his Master's degree in software engineer from Nankai University, Tianjin, in 2017. He has been worked as an engineer in NSCC (National Supercomputer Center in Tianjin), Tianjin, for six years. His main research interests are high performance computing (HPC), machine learning, and supercomputer R&D (research and development) and monitoring.
  • Supported by:
    The work was supported by the National Key Research and Development Program Program of China under Grant No. 2016YFB0201800.

An analysis of real-world operational data of Tianhe-1A (TH-1A) supercomputer system shows that chilled water data not only can reflect the status of a chiller system but also are related to supercomputer load. This study proposes AquaSee, a method that can predict the load and cooling system faults of supercomputers by using chilled water pressure and temperature data. This method is validated on the basis of real-world operational data of the TH-1A supercomputer system at the National Supercomputer Center in Tianjin. Datasets with various compositions are used to construct the prediction model, which is also established using different prediction sequence lengths. Experimental results show that the method that uses a combination of pressure and temperature data performs more effectively than that only consisting of either pressure or temperature data. The best inference sequence length is two points. Furthermore, an anomaly monitoring system is set up by using chilled water data to help engineers detect chiller system anomalies.

Key words: supercomputer, chilled water data, sensor network, load prediction

[1] Yang X J, Liao X K, Lu K et al. The Tianhe-1A supercomputer:Its hardware and software. Journal of Computer Science and Technology, 2011, 26(3):344-351.
[2] Sîrbu A, Babaoglu Ö. Towards a systematic ana-lysis of cluster computing log data:The case of IBM BlueGene/Q. arXiv:1410.4449v2, 2014. https://arxiv.org/pdf/1410.4449v2.pdf,June 2019.
[3] Patnaik D, Marwah M, Sharma R K et al. Data mining for modeling chiller systems in data centers. In Proc. the 9th International Symposium on Intelligent Data Analysis, May 2010, pp.125-136.
[4] Patnaik D, Marwah M, Sharma R K et al. Temporal data mining approaches for sustainable chiller management in data centers. ACM Transactions on Intelligent Systems and Technology, 2011, 2(4):Article No. 34.
[5] Chou J S, Hsu Y C, Lin L T. Smart meter monitoring and data mining techniques for predicting refrigeration system performance. Expert Systems with Applications, 2014, 41(5):2144-2156.
[6] Zapater M, Tuncer O, Ayala J L et al. Leakage-aware cooling management for improving server energy efficiency. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(10):2764-2777.
[7] Dayarathna M, Wen Y, Fan R. Data center energy consumption modeling:A survey. IEEE Communications Surveys & Tutorials, 2017, 18(1):732-794.
[8] Banerjee A, Mukherjee T, Varsamopoulos G et al. Coolingaware and thermal-aware workload placement for green HPC data centers. In Proc. the 2010 International Green Computing Conference, August 2010, pp.245-256.
[9] Chen T, Wang X, Giannakis G B. Cooling-aware energy and workload management in data centers via stochastic optimization. IEEE Journal of Selected Topics in Signal Processing, 2016, 10(2):402-415.
[10] Liu Z, Chen Y, Bash C et al. Renewable and cooling aware workload management for sustainable data centers. ACM SIGMETRICS Performance Evaluation Review, 2012, 40(1):175-186.
[11] Li Y L, Wen Y G, Guan K, Tao D C. Transforming cooling optimization for green data center via deep reinforcement learning. IEEE Transactions on Cybernetics. doi:10.1109/TCYB.2019.2927410.
[12] O'Brien K, Pietri I, Reddy R et al. A survey of power and energy predictive models in HPC systems and applications. ACM Computing Surveys, 2017, 50(3):Article No. 37.
[13] Etinski M, Corbalán J, Labarta J et al. Utilization driven power-aware parallel job scheduling. Computer Science-Research and Development, 2010, 25(3-4):207-216.
[14] Butts J A, Sohi G S. A static power model for architects. In Proc. the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, December 2000, pp.191-201.
[15] Carbó A, Oró E, Salom J, Canuto M, Macías M, Guitart J. Experimental and numerical analysis for potential heat reuse in liquid cooled data centres. Energy Conversion and Management, 2016, 112:135-145.
[16] Xu H, Feng C, Li B. Temperature aware workload management in geo-distributed data centers. ACM SIGMETRICS Performance Evaluation Review, 2013, 41(1):373-374.
[17] Bates N J, Ghatikar G, Abdulla G et al. Electrical grid and supercomputing centers:An investigative analysis of emerging opportunities and challenges. Informatik Spektrum, 2015, 38(2):111-127.
[18] Bai Y, Gu L, Qi X. Comparative study of energy performance between chip and inlet temperature-aware workload allocation in air-cooled data center. Energies, 2018, 11(3):Article No. 669.
[19] Meng J, Mccauley S, Kaplan F, Leung V, Coskun A. Simulation and optimization of HPC job allocation for jointly reducing communication and cooling costs. Sustainable Computing:Informatics and Systems, 2015, 6:48-57.
[20] Rahmani R, Moser I, Seyedmahmoudian M. A complete model for modular simulation of data centre power load. arXiv:1804.00703, 2018. https://arxiv.org/abs/1804.00703,June 2019.
[21] Ranganathan P, Leech P, Irwin D et al. Ensemblelevel power management for dense blade servers. ACM SIGARCH Computer Architecture News, 2006, 34(2):66-77.
[22] Hilburg J C S, Zapater M, Risco-Martín J L et al. Unsupervised power modeling of co-allocated workloads for energy efficiency in data centers. In Proc. the 2016 Design, Automation & Test in Europe Conference & Exhibition, March 2016, pp.1345-1350.
[23] Sapankevych N I, Sankar R. Time series prediction using support vector machines:A survey. IEEE Computational Intelligence Magazine, 2009, 4(2):24-38.
[24] Roy N, Dubey A, Gokhale A. Efficient autoscaling in the cloud using predictive models for workload forecasting. In Proc. the 4th IEEE International Conference on Cloud Computing, July 2011, pp.500-507.
[25] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8):1735-1780.
[26] Kumar J, Goomer R, Singh A K. Long short term memory recurrent neural network (LSTM-RNN) based workload forecasting model for cloud datacenters. Procedia Computer Science, 2018, 125:676-682.
[27] Kong W, Dong Z Y, Jia Y et al. Short-term residential load forecasting based on LSTM recurrent neural network. IEEE Transactions on Smart Grid, 2019, 10(1):841-851.
[28] Krstanovic S, Paulheim H. Ensembles of recurrent neural networks for robust time series forecasting. In Proc. the 37th SGAI International Conference on Artificial Intelligence, December 2017, pp.34-46.
[29] Malhotra P, Vig L, Shroff G, Agarwal P. Long short term memory networks for anomaly detection in time series. In Proc. the 23rd European Symposium on Artificial Neural Networks, April 2015, Article No. 15.
[30] Bontemps L, Cao V L, Mcdermott J et al. Collective anomaly detection based on long short term memory recurrent neural network. arXiv:1703.09752, 2017. https://arxiv.org/abs/1703.09752,June 2019.
[31] Filonov P, Lavrentyev A, Vorontsov A. Multivariate industrial time series with cyber-attack simulation:Fault detection using an LSTM-based predictive data model. arXiv:1612.06676, 2016. https://arxiv.org/abs/1612.06676,June 2019.
[32] Hundman K, Constantinou V, Laporte C et al. Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding. In Proc. the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 2018, pp.387-395.
[33] Wong C, Houlsby N, Lu Y et al. Transfer learning with Neural AutoML. arXiv:1803.02780v3, 2018. http://export.arxiv.org/abs/1803.02780v3,Aug.2019.
[1] Shou-Wan Gao, Peng-Peng Chen, Xu Yang, Qiang Niu. Multi-Sensor Estimation for Unreliable Wireless Networks with Contention-Based Protocols [J]. Journal of Computer Science and Technology, 2018, 33(5): 1072-1085.
[2] Rui-Tao Liu, Zuo-Ning Chen. A Large-Scale Study of Failures on Petascale Supercomputers [J]. , 2018, 33(1): 24-41.
[3] Yawar Abbas Bangash, Ling-Fang Zeng, Dan Feng. MimiBS:Mimicking Base-Station to Provide Location Privacy Protection in Wireless Sensor Networks [J]. , 2017, 32(5): 991-1007.
[4] Hai-Ming Chen, Li Cui, Gang Zhou. A Light-Weight Opportunistic Forwarding Protocol with Optimized Preamble Length for Low-Duty-Cycle Wireless Sensor Networks [J]. , 2017, 32(1): 168-180.
[5] Seyed Mehdi Tabatabaei, Vesal Hakami, Mehdi Dehghan. Cognitive Power Management in Wireless Sensor Networks [J]. , 2015, 30(6): 1306-1317.
[6] Xiang-Ke Liao, Zheng-Bin Pang, Ke-Fei Wang, Yu-Tong Lu, Min Xie, Jun Xia, De-Zun Dong, Guang Suo. High Performance Interconnect Network for Tianhe System [J]. , 2015, 30(2): 259-272.
[7] Rui Li, Ke-Bin Liu, Xiangyang Li, Yuan He, Wei Xi, Zhi Wang, Ji-Zhong Zhao, Meng Wan. Assessing Diagnosis Approaches for Wireless Sensor Networks: Concepts and Analysis [J]. , 2014, 29(5): 887-900.
[8] Xiao-Long Zheng and Meng Wan. A Survey on Data Dissemination in Wireless Sensor Networks [J]. , 2014, 29(3): 470-486.
[9] Xiang-Ke Liao, Can-Qun Yang, Tao Tang Hui-Zhan Yi, Feng Wang, Qiang Wu, and Jingling Xue. OpenMC:Towards Simplifying Programming for TianHe Supercomputers [J]. , 2014, 29(3): 532-546.
[10] Hai-Long Shi, Dong Li, Jie-Fan Qiu, Chen-Da Hou, Li Cui. A Task Execution Framework for Cloud-Assisted Sensor Networks [J]. , 2014, 29(2): 216-226.
[11] Zheng Gong, Pieter Hartel, Svetla Nikova, Shao-Hua Tang, and Bo Zhu. TuLP:A Family of Lightweight Message Authentication Codes for Body Sensor Networks [J]. , 2014, 29(1): 53-68.
[12] Jin-Tao Meng, Jian-Rui Yuan, Sheng-Zhong Feng, and Yan-Jie Wei. An Energy Efficient Clustering Scheme for Data Aggregation in Wireless Sensor Networks [J]. , 2013, 28(3): 564-573.
[13] Mo Chen, (陈默), Student Member, CCF, ACM Ge Yu, (于戈), Senior Member, CCF, Member, ACM, IEEE, Yu Gu (谷峪), Member, CCF, ACM. An Efficient Method for Cleaning Dirty-Events over Uncertain Data in WSNs [J]. , 2011, 26(6): 942-953.
[14] Bo Yu (于博) and Jian-Zhong Li (李建中), Member, CCF. Minimum-Time Aggregation Scheduling in Duty-Cycled Wireless Sensor Networks [J]. , 2011, 26(6): 962-970.
[15] Feng Wang (王锋) Member, CCF, ACM, Can-Qun Yang (杨灿群), Yun-Fei Du (杜云飞), Juan Chen (陈娟), Hui-Zhan Yi (易会战), and Wei-Xia Xu (徐炜遐). Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer [J]. , 2011, 26(5): 854-865.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Sun Zhongxiu; Shang Lujun;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[2] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[3] Qu Yanwen;. AGDL: A Definition Language for Attribute Grammars[J]. , 1986, 1(3): 80 -91 .
[4] Wang Jianchao; Wei Daozheng;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[5] Xu Xiaoshu;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[6] Gong Zhenhe;. On Conceptual Model Specification and Verification[J]. , 1987, 2(1): 35 -50 .
[7] Zhong Renbao; Xing Lin; Ren Zhaoyang;. An Interactive System SDI on Microcomputer[J]. , 1987, 2(1): 64 -71 .
[8] Chen Shicheng; Zhou Zhongyi;. On Interrupt Strategy from the Point of View of System Efficiency[J]. , 1987, 2(3): 217 -225 .
[9] Xie Li; Chen Peipei; Yang Peigen; Sun Zhongxiu;. The Design and Implementation of an OA System ZGL1[J]. , 1988, 3(1): 75 -80 .
[10] Chen Qiming;. Extending the Object-Oriented Paradigm for Supporting Complex Objects[J]. , 1988, 3(2): 113 -130 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved