With the rapid development of supercomputers, the scale and complexity are ever increasing, and the reliability and resilience are faced with larger challenges. There are many important technologies in fault tolerance, such as proactive failure avoidance technologies based on fault prediction, reactive fault tolerance based on checkpoint, and scheduling technologies to improve reliability. Both qualitative and quantitative descriptions on characteristics of system faults are very critical for these technologies. This study analyzes the source of failures on two typical petascale supercomputers called Sunway BlueLight (based on multi-core CPUs) and Sunway TaihuLight (based on heterogeneous manycore CPUs). It uncovers some interesting fault characteristics and finds unknown correlation relationship among main components' faults. Finally the paper analyzes the failure time of the two supercomputers in various grains of resource and different time spans, and builds a uniform multi-dimensional failure time model for petascale supercomputers.
The work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200502.
About author: Rui-Tao Liu received his Bachelor's degree in computer science and technology from National University of Defense Technology (NUDT), Changsha, in 2000. He then received his Master's degree in computer software and theory from Jiangnan Institute of Computing Technology, Wuxi, in 2004. He is currently an engineer and Ph.D. candidate in State Key Laboratory of Mathematical Engineering and Advanced Computing (MEAC), Wuxi. His research interests include high-performance computing, parallel operating system, fault tolerance, big data, etc.
Rui-Tao Liu, Zuo-Ning Chen.P级超级计算机失效研究[J] Journal of Computer Science and Technology , 2018,V33(1): 24-41
Rui-Tao Liu, Zuo-Ning Chen.A Large-Scale Study of Failures on Petascale Supercomputers[J] Journal of Computer Science and Technology, 2018,V33(1): 24-41
 Cappello F. Resilience:One of the main challenges for exascale computing. Technical Report of the INRIA-Illinois Joint Laboratory, 2011. Kusnezov D, Binkley s, Harrod B, Meisner B. DOE exascale initiative. Technical Report of US Department of Energy (DOE), 2013. https://energy.gov/downloads/doe-exascaleinitiative, Dec. 2017. Kogge P, Bergman K, Borkar S et al. Exascale computing study:Technology challenges in achieving exascale systems. 2008. http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf, Dec. 2017. Schroeder B, Gibson G A. A large-scale study of failures in high-performance computing systems. IEEE Transactions on Dependable and Secure Computing, 20107(4):337-350 Liang Y, Zhang Y, Jette M, Sivasubramaniam A, Sahoo R. BlueGene/L failure analysis and prediction models. In Proc. the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), June 2006, pp.425-434. Zheng Z, Lan Z, Park B H et al. System log pre-processing to improve failure prediction. In Proc. IEEE/IFIP International Conference Dependable Systems and Networks, June 29-July 2, 2009. Zheng Z, Yu L, Tang W et al. Co-analysis of RAS log and job log on Blue Gene/P. In Proc. the 2011 IEEE International Parallel & Distributed Processing Symposium, May 2011 pp.840-851. Zheng Z, Lan Z. Reliability-aware scalability models for high performance computing. In Proc. IEEE International Conference Cluster Computing and Workshops, Aug. 31-Sept. 4, 2009. Heien E, LaPine D, Kondo D et al. Modeling and tolerating heterogeneous failures in large parallel systems. In Proc. the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2011, Article No. 45. Nie B, Tiwari D, Gupta S et al. A large-scale study of softerrors on GPUs in the field. In Proc. the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), March 2016, pp.519-530. Schroeder B, Pinheiro E, Weber W. DRAM errors in the wild:A large-scale field study. In Proc. the 11th International Joint Conference on Measurement and Modeling of Computer Systems, June 2009, pp.193-204. Pinheiro E, Weber W, Barroso L A. Failure trends in a large disk drive population. In Proc. the 5th USENIX Conference on File and Storage Technologies, February 2007, pp.17-28. Gunawi H S, Hao M, Suminto R O et al. Why does the cloud stop computing?:Lessons from hundreds of service outages. In Proc. the 7th ACM Symposium on Cloud Computing, October 2016, pp.1-16. Gunawi H S, Hao M, Leesatapornwongsa T et al. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proc. the ACM Symposium on Cloud Computing, November 2014, pp.1-14. Huang P, Guo C, Zhou L et al. Gray failure:The Achilles' heel of cloud-scale systems. In Proc. the 16th Workshop on Hot Topics in Operating Systems, May 2017, pp.150-155. Zheng Z, Lan Z, Gupta R et al. A practical failure prediction with location and lead time for Blue Gene/P. In Proc. the 2010 International Conference Dependable Systems and Networks Workshops (DSN-W), June 28-July 1, 2010. Sahoo R K, Oliner A J, Rish I et al. Critical event prediction for proactive management in large-scale computer clusters. In Proc. the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2003, pp.426-435. Gu J, Zheng Z, Lan Z et al. Dynamic meta-learning for failure prediction in large-scale systems:A case study. In Proc. the International Conference on Parallel Processing, Sept. 2008. Gainaru A, Cappello F, Snir M et al. Fault prediction under the microscope:A closer look into HPC systems. In Proc. the International Conference on High Performance Computing, Networking, Storage and Analysis, November 2012, Article No. 77. Lu X, Wang H Q, Zhou R J et al. Autonomic failure prediction based on manifold learning for large-scale distributed systems. The Journal of China Universities of Posts and Telecommunications, 2010, 17(4):116-124. Srikant R, Agrawal R. Mining sequential patterns:Generalizations and performance improvements. In Lecture Notes in Computer Science 1057, Apers P, Bouzeghoub M, Gardarin G (eds.), June 2005. Mannila H, Toivonen H, Verkamo A I. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1997, 1(3):259-289. Joshi M, Karypis G, Kumar V. A universal formulation of sequential patterns. Technical Report, No.99-021, University of Minnesota. https://www.cs.umn.edu/research/technicalreports/view/99-021, Dec. 2017. Fournier-Viger P, Wu C W, Tseng V S et al. Mining sequential rules common to several sequences with the window size constraint. In Proc. the 25th Conference on Advances in Artificial Intelligence, May 2012, pp.299-304. Fournier-Viger P, Wu C W, Tseng V S et al. Mining partially-ordered sequential rules common to multiple sequences. IEEE Transactions on Knowledge and Data Engineering, 27(8):2203-2216. Zhang Z. Reliability Theory and Engineering Application. Beijing:Science Press, 2012. (in Chinese)
Copyright 2010 by Journal of Computer Science and Technology