? A Large-Scale Study of Failures on Petascale Supercomputers
Journal of Computer Science and Technology
Quick Search in JCST
 Advanced Search 
      Home | PrePrint | SiteMap | Contact Us | FAQ
 
Indexed by   SCIE, EI ...
Bimonthly    Since 1986
Journal of Computer Science and Technology 2018, Vol. 33 Issue (1) :24-41    DOI: 10.1007/s11390-018-1806-7
Computer Architecture and Systems Current Issue | Archive | Adv Search << Previous Articles | Next Articles >>
A Large-Scale Study of Failures on Petascale Supercomputers
Rui-Tao Liu1, Zuo-Ning Chen2, Fellow, CCF
1 State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214215, China;
2 National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China

Abstract
Reference
Related Articles
Download: [PDF 2058KB]     Export: BibTeX or EndNote (RIS)  
Abstract With the rapid development of supercomputers, the scale and complexity are ever increasing, and the reliability and resilience are faced with larger challenges. There are many important technologies in fault tolerance, such as proactive failure avoidance technologies based on fault prediction, reactive fault tolerance based on checkpoint, and scheduling technologies to improve reliability. Both qualitative and quantitative descriptions on characteristics of system faults are very critical for these technologies. This study analyzes the source of failures on two typical petascale supercomputers called Sunway BlueLight (based on multi-core CPUs) and Sunway TaihuLight (based on heterogeneous manycore CPUs). It uncovers some interesting fault characteristics and finds unknown correlation relationship among main components' faults. Finally the paper analyzes the failure time of the two supercomputers in various grains of resource and different time spans, and builds a uniform multi-dimensional failure time model for petascale supercomputers.
Articles by authors
Keywordspetascale supercomputer   fault characteristic   correlation relationship   multi-dimension   failure time model     
Received 2017-07-29;
Fund:

The work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200502.

About author: Rui-Tao Liu received his Bachelor's degree in computer science and technology from National University of Defense Technology (NUDT), Changsha, in 2000. He then received his Master's degree in computer software and theory from Jiangnan Institute of Computing Technology, Wuxi, in 2004. He is currently an engineer and Ph.D. candidate in State Key Laboratory of Mathematical Engineering and Advanced Computing (MEAC), Wuxi. His research interests include high-performance computing, parallel operating system, fault tolerance, big data, etc.
Cite this article:   
Rui-Tao Liu, Zuo-Ning Chen.A Large-Scale Study of Failures on Petascale Supercomputers[J]  Journal of Computer Science and Technology, 2018,V33(1): 24-41
URL:  
http://jcst.ict.ac.cn:8080/jcst/EN/10.1007/s11390-018-1806-7
Copyright 2010 by Journal of Computer Science and Technology