? P级超级计算机失效研究
Journal of Computer Science and Technology
Quick Search in JCST
 Advanced Search 
      Home | PrePrint | SiteMap | Contact Us | Help
 
Indexed by   SCIE, EI ...
Bimonthly    Since 1986
Journal of Computer Science and Technology 2018, Vol. 33 Issue (1) :24-41    DOI: 10.1007/s11390-018-1806-7
Computer Architecture and Systems << Previous Articles | Next Articles >>
P级超级计算机失效研究
Rui-Tao Liu1, Zuo-Ning Chen2, Fellow, CCF
1 State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214215, China;
2 National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China
A Large-Scale Study of Failures on Petascale Supercomputers
Rui-Tao Liu1, Zuo-Ning Chen2, Fellow, CCF
1 State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214215, China;
2 National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China

摘要
参考文献
相关文章
Download: [PDF 2058KB]  
摘要 随着超级计算机的快速发展,系统规模和复杂度不断增加。系统可靠性和容错能力面临着巨大挑战。无论是基于故障预测技术的前瞻式避错,还是基于检查点技术的被动式容错,或者提升系统可靠性的调度技术,都需要具有对系统故障特征的精细的定性与定量描述。本文研究了神威蓝光(基于多核)与太湖之光(基于异构众核)两台典型的P级超级计算机的失效原因,揭示了以前尚未掌握的主要部件故障的发生特征与关联关系。最后,本文研究了这两台超级计算机在不同资源粒度和时间区间上的失效时间特征,并为P级超级计算机建立了多维统一的失效时间模型。
关键词P级超级计算机   故障特征   关联关系   多维度   失效时间模型     
Abstract: With the rapid development of supercomputers, the scale and complexity are ever increasing, and the reliability and resilience are faced with larger challenges. There are many important technologies in fault tolerance, such as proactive failure avoidance technologies based on fault prediction, reactive fault tolerance based on checkpoint, and scheduling technologies to improve reliability. Both qualitative and quantitative descriptions on characteristics of system faults are very critical for these technologies. This study analyzes the source of failures on two typical petascale supercomputers called Sunway BlueLight (based on multi-core CPUs) and Sunway TaihuLight (based on heterogeneous manycore CPUs). It uncovers some interesting fault characteristics and finds unknown correlation relationship among main components' faults. Finally the paper analyzes the failure time of the two supercomputers in various grains of resource and different time spans, and builds a uniform multi-dimensional failure time model for petascale supercomputers.
Keywordspetascale supercomputer   fault characteristic   correlation relationship   multi-dimension   failure time model     
Received 2017-07-29;
本文基金:

The work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200502.

About author: Rui-Tao Liu received his Bachelor's degree in computer science and technology from National University of Defense Technology (NUDT), Changsha, in 2000. He then received his Master's degree in computer software and theory from Jiangnan Institute of Computing Technology, Wuxi, in 2004. He is currently an engineer and Ph.D. candidate in State Key Laboratory of Mathematical Engineering and Advanced Computing (MEAC), Wuxi. His research interests include high-performance computing, parallel operating system, fault tolerance, big data, etc.
引用本文:   
Rui-Tao Liu, Zuo-Ning Chen.P级超级计算机失效研究[J]  Journal of Computer Science and Technology , 2018,V33(1): 24-41
Rui-Tao Liu, Zuo-Ning Chen.A Large-Scale Study of Failures on Petascale Supercomputers[J]  Journal of Computer Science and Technology, 2018,V33(1): 24-41
链接本文:  
http://jcst.ict.ac.cn:8080/jcst/CN/10.1007/s11390-018-1806-7
Copyright 2010 by Journal of Computer Science and Technology