P级超级计算机失效研究

doi:10.1007/s11390-018-1806-7

P级超级计算机失效研究

A Large-Scale Study of Failures on Petascale Supercomputers

摘要

摘要: 随着超级计算机的快速发展，系统规模和复杂度不断增加。系统可靠性和容错能力面临着巨大挑战。无论是基于故障预测技术的前瞻式避错，还是基于检查点技术的被动式容错，或者提升系统可靠性的调度技术，都需要具有对系统故障特征的精细的定性与定量描述。本文研究了神威蓝光（基于多核）与太湖之光（基于异构众核）两台典型的P级超级计算机的失效原因，揭示了以前尚未掌握的主要部件故障的发生特征与关联关系。最后，本文研究了这两台超级计算机在不同资源粒度和时间区间上的失效时间特征，并为P级超级计算机建立了多维统一的失效时间模型。

Abstract: With the rapid development of supercomputers, the scale and complexity are ever increasing, and the reliability and resilience are faced with larger challenges. There are many important technologies in fault tolerance, such as proactive failure avoidance technologies based on fault prediction, reactive fault tolerance based on checkpoint, and scheduling technologies to improve reliability. Both qualitative and quantitative descriptions on characteristics of system faults are very critical for these technologies. This study analyzes the source of failures on two typical petascale supercomputers called Sunway BlueLight (based on multi-core CPUs) and Sunway TaihuLight (based on heterogeneous manycore CPUs). It uncovers some interesting fault characteristics and finds unknown correlation relationship among main components' faults. Finally the paper analyzes the failure time of the two supercomputers in various grains of resource and different time spans, and builds a uniform multi-dimensional failure time model for petascale supercomputers.

HTML全文

参考文献()

施引文献

资源附件()