›› 2012,Vol. ›› Issue (2): 273-280.doi: 10.1007/s11390-012-1222-3

• • 上一篇    下一篇

基于任务完成概率的双模冗余检查点管理

Seong Woo Kwak1, Kwan-Ho You2, and Jung-Min Yang3   

  • 收稿日期:2011-04-12 修回日期:2011-09-27 出版日期:2012-03-05 发布日期:2012-03-05

Checkpoint Management with Double Modular Redundancy Based on the Probability of Task Completion

Seong Woo Kwak1, Kwan-Ho You2, and Jung-Min Yang3   

  1. 1. Department of Electronic Engineering, Keimyung University, Daegu 704-701, Korea;
    2. School of Information & Communication Engineering, Sungkyunkwan University, Suwon 440-746, Korea;
    3. Department of Electrical Engineering, Catholic University of Daegu, Daegu 712-702, Korea
  • Received:2011-04-12 Revised:2011-09-27 Online:2012-03-05 Published:2012-03-05

在实时系统中,容错性是一项很重要且基本的需求,而检查点和双模冗余机制是两个典型的解决方案。检查点用来保存任务的中间状态,当错误被系统检测到时,任务会被回滚到前一个状态。双模冗余则是分配两个处理器,用来并行地处理相同的任务,并将它们的结果进行比较,若发现结果不一致则证明有错误发生。本文提出了一个容错策略,将这两个机制融合在一起,不需要内置的出错检测模块和另外的处理器,便可让实时系统从暂时性或永久性的错误中恢复。此外,系统的容错性能也能够获得提高。据作者所了解,在双模冗余(DMR)框架下,在不使用多余处理器的条件下,类似的容错机制尚未被实现过。同时本文还提出了一个优化过程来选择检查点的数量:当假设错误的发生概率符合泊松分布时,该过程便为最优,可使得任务的完成概率最大,并提高整个系统的可靠性。本文提出方案的实现方法与传统的DMR方案不同:传统的DMR仅采用两个存储器来保存一个处理器的前后状态;而本文方案则分配多一个另外的存储器,来保存一个处理器最后一次成功执行后的状态,还有一个标志位用来记录当前状态是否与上一状态相同。根据这些新增加的被记录下来的状态,作者提出了一个双层比较的DMR检查点算法(包括同一处理器前后状态的比较以及两个处理器的同一状态点的比较),并证明他们的算法在不同的情景下(无论是暂时性的或永久性的错误发生),都能够适用。此外,为了建立马尔可夫模型,作者定义了四个状态,用来表示两个处理器将要执行的下一个时间间隔时的不同状态。完成建模后,作者通过数学方法,计算出不同情景下的任务完成率。最后通过对数个例子仿真,并对所得数据进行分析,获取不同情景下的最佳检查点数。本文提出的方案新颖,双模冗余和检查点的结合,提高了系统的容错性能,并避免了系统在相同的错误不断产生时执行回滚。此外,当有一个处理器发生永久性错误时,它也能够以双倍的执行时间为代价来解决该问题。此外,通过建立马尔可夫模型,本文提出的算法还能够获取检查点的最优数目,使得任务的完成概率被最大化。对于1)没有内置的错误检查模块和剩余的处理器 2)永久性错误的发生概率不可忽略 的实时系统,可以考虑使用本文提出的DMR与检查点方案去解决容错问题。而最优检查点数的选择,也可被用于提高系统的工作性能,减少执行回滚时的损失时间。

Abstract: This paper proposes a checkpoint rollback strategy for real-time systems with double modular redundancy. Without built-in fault-detection and spare processors, our scheme is able to recover from both transient and permanent faults. Two comparisons are conducted at each checkpoint. First, the states stored in two consecutive checkpoints of one processor are compared for checking integrity of the processor. The states of two processors are also compared for detecting faults and the system rolls back to the previous checkpoint whenever required by logic of the proposed scheme. A Markov model is induced by the fault recovery scheme and analyzed to provide the probability of task completion within its deadline. The optimal number of checkpoints is selected so as to maximize the probability of task completion.

[1] Young J W. A first order approximation to the optimal check-point intervals. Commun. the ACM, 1974, 17(9): 530-531.

[2] Naruse K, Umemura S, Nakagawa, S. Optimal checkpoint-ing interval for two-level recovery schemes. Computers andMathematics with Applications, 2006, 51(2): 371-376.

[3] Ziv A, Bruck J. Performance optimization of checkpointingschemes with task duplication. IEEE Transactions on Com-puters, 1997, 46(12): 1381-1386.

[4] Nakagawa S, Fukumoto S, Ishii N. Optimal checkpointingintervals for a double modular redundancy with signatures.Comput. and Math. with Applicat., 2003, 46(7): 1089-1094.

[5] Krishina C M, Shin K G. Real-Time Systems. McGraw-Hill,1997.

[6] Pradhan D K, Vaidya N H. Roll-forward checkpointingscheme: A novel fault-tolerant architecture. IEEE Tran-sactions on Computers, 1994, 43(10): 1163-1174.

[7] Ziv A, Bruck J. Analysis of checkpointing schemes with taskduplication. IEEE Trans. Computers, 1998, 47(2): 222-227.

[8] Pradhan D K, Vaidya N H. Roll-forward and rollback recov-ery: Performance-reliability trade-off. IEEE Transactions onComputers, 1997, 46(3): 372-378.

[9] Tiwari A, Tomko K A. Enhanced reliability of finite-state ma-chines in FPGA through efficient fault detection and correc-tion. IEEE Transactions on Reliability, 2005, 54(3): 459-467.

[10] Yang J M, Kwak S W. A checkpoint scheme with task du-plication considering transient and permanent fault. In Proc.IEEE Int. Conf. Industrial Engineering and EngineeringManagement (IEEM2010), Dec. 2010, pp.606-610.

[11] Karpovsky M, Su S Y H. Detection and location of input andfeedback bridging faults among input and output lines. IEEETransactions on Computers, 1980, C-29(6): 523-527.

[12] Hashizume M, Yotsuyanagi H, Tamesada T. Identification offeedback bridging faults with oscillation. In Proc. the 8thAsian Test Symposium, Nov. 1999, pp.25-30.

[13] Konuk H, Ferguson F J. Oscillation and sequential behaviorcaused by opens in the routing in digital CMOS circuits.IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, 1998, 17(11): 1200-1210.

[14] Berdjag D, Zolghadri A, Cieslak J, Goupil P. Fault detectionand isolation for redundant aircraft sensors. In Proc. SysTol2010, Oct. 2010, pp.137-142.

[15] Kwak S W, Choi B J, Kim B K. Optimal checkpointing strat-egy for real-time control systems under faults with exponen-tial duration. IEEE Trans. Reliability, 2001, 50(3): 293-301.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 李卫东; 魏道政;. Test Derivation Through Critical Path Transitions[J]. , 1992, 7(1): 12 -18 .
[2] Harald E. Otto;. UNDO, An Aid for Explorative Learning?[J]. , 1992, 7(3): 226 -236 .
[3] 顾君忠;. Modelling Enterprises with Object-Oriented Paradigm[J]. , 1993, 8(3): 80 -89 .
[4] 王晖; 刘大有; 王亚飞;. Sequential Back-Propagation[J]. , 1994, 9(3): 252 -260 .
[5] 蔡家楣;. The Sequence Modeling Method Based on ECCin Developing Program Specifications[J]. , 1999, 14(4): 337 -348 .
[6] . 基于原子 Voronoi图构建原子结构Connolly曲面的方法[J]. , 2006, 21(2): 255 -260 .
[7] . 暂缺[J]. , 2006, 21(5): 823 -837 .
[8] . 暂缺[J]. , 2006, 21(5): 708 -722 .
[9] . 对两种高速缓存错失同时作最小化[J]. , 2007, 22(4): 497 -504 .
[10] . 基于有理DMS样条体的自由变形[J]. , 2008, 23(5 ): 862 -873 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: