›› 2012,Vol. ›› Issue (2): 240-255.doi: 10.1007/s11390-012-1220-5

所属专题: Computer Architecture and Systems

• • 上一篇    下一篇


Xin-Hai Xu1 (徐新海), Student Member, CCF, ACM Xue-Jun Yang1 (杨学军), Senior Member, CCF, Member, ACM, IEEE Jing-Ling Xue2 (薛京灵), Senior Member, IEEE, Member, ACM Yu-Fei Lin1 (林宇斐), Student Member, CCF, ACM, and Yi-Song Lin1 (林一松)   

  • 收稿日期:2011-06-13 修回日期:2012-01-06 出版日期:2012-03-05 发布日期:2012-03-05

PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

Xin-Hai Xu1 (徐新海), Student Member, CCF, ACM Xue-Jun Yang1 (杨学军), Senior Member, CCF, Member, ACM, IEEE Jing-Ling Xue2 (薛京灵), Senior Member, IEEE, Member, ACM Yu-Fei Lin1 (林宇斐), Student Member, CCF, ACM, and Yi-Song Lin1 (林一松)   

  1. 1. National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha 410073, China;
    2. Programming Languages and Compilers Group, School of Computer Science and Engineering University of New South Wales, Sydney, Australia
  • Received:2011-06-13 Revised:2012-01-06 Online:2012-03-05 Published:2012-03-05
  • Supported by:

    This work was supported by the National Natural Science Foundation of China under Grant Nos. 60921062, 61003087, 61120106005 and 61170049.

在由CPU和GPU构成的计算机系统中,GPGPU越来越多的被用作高性能计算应用的加速器,比如去年国防科学技术大学制造的,在当时Top500榜单中排名第一的天河-1A系统。但是,尽管拥有性能优势,GPGPU并不提供内置用于提高可靠性的容错方法,而这正是高性能计算应用所必需的。通过分析程序在GPGPU上运行时的SIMT特性,我们开发了一种新的基于检查点编译指导的部分复算方法——PartialRC,通过利用GPGPU超高性能,实现高效的故障恢复。在本文中,我们提出了我们的PartialRC,它可以在检测出一段代码的计算错误后,针对该段代码进行部分复算;描述了基于PartialRC的基于检查点的容错框架;并讨论了其在CUDA平台上的实现。通过在NVIDIA GPGPU上对一些典型CUDA程序的测试,与FullPC(一种传统的针对CPU的基于检查点-回滚-重启的故障恢复方法)相比,PartialRC可以显著降低故障恢复的开销:对于发生较早和较晚的两类错误分别平均降低73.5%和74.6%。另外,在不增加无故障性能开销的同时,PartialRC还降低了由于全部复算所引起的错误检测开销。

Abstract: GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based fault-tolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.

[1] Luebke D, Harris M, Kr?uger J, Purcell T, Govindaraju N,Buck I,Woolley C, Lefohn A. GPGPU: General-purpose com-putation on graphics hardware. In Proc. SIGGRAPH 2004Course Notes, New York, NY, USA, Aug. 2004, p.33.

[2] Owens J, Luebke D, Govindaraju N, Harris M, Kr?uger J,Lefohn A, Purcell T. A survey of general-purpose computa-tion on graphics hardware. Computer Graphics Forum, Mar.2007, 26(1): 80-113.

[3] Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Hous-ton M, Hanrahan P. Brook for GPUs: Stream computing ongraphics hardware. In Proc. ACM SIGGRAPH 2004 Papers,New York, NY, USA, Aug. 2004, pp.777-786.

[4] AMD. Brook+. http://developer.amd. com/gpu assets/AMD-Brookplus.pdf.

[5] NVIDIA Corporation. Cuda programming guide, 2008.http://www.nvidia.com/object/cuda develop.html.

[6] Lee S, Min S J, Eigenmann R. OpenMP to GPGPU: A com-piler framework for automatic translation and optimization.ACM SIGPLAN Notices, April 2009, 44(4): 101-110.

[7] Top500 Supercomputer Site. http://www.top500.org/lists/2010/11.

[8] Yim K S, Pham C, Saleheen M, kalbarczyk Z, Iyer R.Hauberk: Lightweight silent data corruption error detectorsfor GPGPU. In Proc. the 25th Int. Parallel & DistributedProcessing Symposium, Anchorage, USA, May 2011, pp.287-300.

[9] Borucki L, Schindlbeck G, Slayman C. Comparison of accele-rated DRAM soft error rates measured at component and sys-tem level. In Proc. the Int. Reliability Physics Symposium,Phoenix, USA, April 27-May 1, 2008, pp.482-487.

[10] Schroeder B, Pinheiro E, Weber W D. DRAM errors in thewild: A large-scale field study. In Proc. the 11th Interna-tional Joint Conf. Measurement and Modeling of ComputerSystems, Seattle, USA, June 15-19, 2009, pp.193-204.

[11] Mukherjee S S, Emer J S, Reinhardt S K. The soft error prob-lem: An architectural perspective. In Proc. the 11th Inter-national Symposium on High-Performance Computer Archi-tecture, February 12-16, 2005, pp.243-247.

[12] Gregerson A E, Abhyankar A V. Performance-cost analy-sis of software implemented hardware fault tolerance meth-ods in general-purpose gpu computing. http://home-pages.cae.wisc.edu/ece753/papers/Paper 4.pdf.

[13] Maruyama N, Nukada A, Matsuoka S. Software-based ECCfor GPUs. In Proc. 2009 Symposium on Application Ac-celerators in High Performance Computing, Urbana, Illinois, USA, July 27-31, 2009.

[14] Sheaffer J W, Luebke D P, Skadron K. A hardware redun-dancy and recovery mechanism for reliable scientific compu-tation on graphics processors. In Proc. the 22nd ACM SIG-GRAPH/EUROGRAPHICS Symposium on Graphics Hard-ware, San Diego, California, USA, August 4-5, 2007, pp.55-64.

[15] Dimitrov M, Mantor M, Zhou H Y. Understanding softwareapproaches for GPGPU reliability. In Proc. the 2nd Work-shop on General Purpose Processing on Graphics Process-ing Units (GPGPU 2009), Washington, USA, March 8, 2009,pp.94-104.

[16] Maruyama N, Nukada A, Matsuoka S. A high-performancefaulttolerant software framework for memory on commodityGPUs. In Proc. 2010 IEEE Int. Symp. Parallel & Dis-tributed Processing, Atlanta, GA, USA, April 19-23, 2010,pp.1-12.

[17] Roman E. A survey of checkpoint/restart implementations.Berkeley Lab Technical Report, July 2002, https://ftg.lbl.gov/assets/projects/CheckpointRestart/Pubs/checkpointSu-rvey-020724b.pdf.

[18] Chandy K M, Ramamoorthy C V. Rollback and recoverystrategies for computer programs. IEEE Transactions onComputers, June 1972, 21(6): 546-556.

[19] Jafar S, Krings A, Gautier T. Flexible rollback recovery indynamic heterogeneous grid computing. IEEE Transactionson Dependable and Secure Computing, 2009, 6(1): 32-44.

[20] Chu S L, Hsiao C C. OpenCL: Make ubiquitous supercom-puting possible. In Proc. the 12th IEEE InternationalConference on High Performance Computing and Commu-nications, Melbourne, Australia, 1-3 Sept. 2010, pp.556-561.

[21] George N, Lach J, Gurumurthi S. Towards transient faulttolerance for heterogeneous computing platforms. InProc. Workshop on Compiler and Architectural Tech-niques for Application Reliability and Security, Anchor-age, Alaska, USA, June 2008, http://www.cs.virginia.edu/?gurumurthi/papers/catars08.pdf.

[22] Goloubeva O, Rebaudengo M, Reorda M S, Violante M.Software-Implemented Hardware Fault Tolerance. New York:Springer, 2006, p.228.

[23] Pradhan D K. Fault-Tolerant Computer System Design.Prentice Hall PTR, 1996.

[24] Reis G A, Chang J, Vachharajani N, Rangan R, August D I.SWIFT: Software implemented fault tolerance. In Proc. theInternational Symposium on Code Generation and Optimiza-tion, Washington, DC, USA, March 2005, pp.243-254.

[25] Dubrova E. Fault-Tolerant Design: An Introduction. KTHRoyal Institute of Technology, Stockholm, Sweden, 2008,http://web.it.kth.se/?dubrova/draft.pdf.
No related articles found!
Full text



[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 许小曙;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[8] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[9] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[10] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn