[1] Cappello F. Resilience:One of the main challenges for exascale computing. Technical Report of the INRIA-Illinois Joint Laboratory, 2011.[2] Kusnezov D, Binkley s, Harrod B, Meisner B. DOE exascale initiative. Technical Report of US Department of Energy (DOE), 2013. https://energy.gov/downloads/doe-exascaleinitiative, Dec. 2017.[3] Kogge P, Bergman K, Borkar S et al. Exascale computing study:Technology challenges in achieving exascale systems. 2008. http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf, Dec. 2017.[4] Schroeder B, Gibson G A. A large-scale study of failures in high-performance computing systems. IEEE Transactions on Dependable and Secure Computing, 20107(4):337-350[5] Liang Y, Zhang Y, Jette M, Sivasubramaniam A, Sahoo R. BlueGene/L failure analysis and prediction models. In Proc. the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), June 2006, pp.425-434.[6] Zheng Z, Lan Z, Park B H et al. System log pre-processing to improve failure prediction. In Proc. IEEE/IFIP International Conference Dependable Systems and Networks, June 29-July 2, 2009.[7] Zheng Z, Yu L, Tang W et al. Co-analysis of RAS log and job log on Blue Gene/P. In Proc. the 2011 IEEE International Parallel & Distributed Processing Symposium, May 2011 pp.840-851.[8] Zheng Z, Lan Z. Reliability-aware scalability models for high performance computing. In Proc. IEEE International Conference Cluster Computing and Workshops, Aug. 31-Sept. 4, 2009.[9] Heien E, LaPine D, Kondo D et al. Modeling and tolerating heterogeneous failures in large parallel systems. In Proc. the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2011, Article No. 45.[10] Nie B, Tiwari D, Gupta S et al. A large-scale study of softerrors on GPUs in the field. In Proc. the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), March 2016, pp.519-530.[11] Schroeder B, Pinheiro E, Weber W. DRAM errors in the wild:A large-scale field study. In Proc. the 11th International Joint Conference on Measurement and Modeling of Computer Systems, June 2009, pp.193-204.[12] Pinheiro E, Weber W, Barroso L A. Failure trends in a large disk drive population. In Proc. the 5th USENIX Conference on File and Storage Technologies, February 2007, pp.17-28.[13] Gunawi H S, Hao M, Suminto R O et al. Why does the cloud stop computing?:Lessons from hundreds of service outages. In Proc. the 7th ACM Symposium on Cloud Computing, October 2016, pp.1-16.[14] Gunawi H S, Hao M, Leesatapornwongsa T et al. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proc. the ACM Symposium on Cloud Computing, November 2014, pp.1-14.[15] Huang P, Guo C, Zhou L et al. Gray failure:The Achilles' heel of cloud-scale systems. In Proc. the 16th Workshop on Hot Topics in Operating Systems, May 2017, pp.150-155.[16] Zheng Z, Lan Z, Gupta R et al. A practical failure prediction with location and lead time for Blue Gene/P. In Proc. the 2010 International Conference Dependable Systems and Networks Workshops (DSN-W), June 28-July 1, 2010.[17] Sahoo R K, Oliner A J, Rish I et al. Critical event prediction for proactive management in large-scale computer clusters. In Proc. the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2003, pp.426-435.[18] Gu J, Zheng Z, Lan Z et al. Dynamic meta-learning for failure prediction in large-scale systems:A case study. In Proc. the International Conference on Parallel Processing, Sept. 2008.[19] Gainaru A, Cappello F, Snir M et al. Fault prediction under the microscope:A closer look into HPC systems. In Proc. the International Conference on High Performance Computing, Networking, Storage and Analysis, November 2012, Article No. 77.[20] Lu X, Wang H Q, Zhou R J et al. Autonomic failure prediction based on manifold learning for large-scale distributed systems. The Journal of China Universities of Posts and Telecommunications, 2010, 17(4):116-124.[21] Srikant R, Agrawal R. Mining sequential patterns:Generalizations and performance improvements. In Lecture Notes in Computer Science 1057, Apers P, Bouzeghoub M, Gardarin G (eds.), June 2005.[22] Mannila H, Toivonen H, Verkamo A I. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1997, 1(3):259-289.[23] Joshi M, Karypis G, Kumar V. A universal formulation of sequential patterns. Technical Report, No.99-021, University of Minnesota. https://www.cs.umn.edu/research/technicalreports/view/99-021, Dec. 2017.[24] Fournier-Viger P, Wu C W, Tseng V S et al. Mining sequential rules common to several sequences with the window size constraint. In Proc. the 25th Conference on Advances in Artificial Intelligence, May 2012, pp.299-304.[25] Fournier-Viger P, Wu C W, Tseng V S et al. Mining partially-ordered sequential rules common to multiple sequences. IEEE Transactions on Knowledge and Data Engineering, 27(8):2203-2216.[26] Zhang Z. Reliability Theory and Engineering Application. Beijing:Science Press, 2012. (in Chinese) |