We use cookies to improve your experience with our site.

Partial-TMR:一种基于生命周期分析的寄存器文件抗软错误方法

Partial-TMR: A New Method for Protecting Register Files Against Soft Error Based on Lifetime Analysis

  • 摘要: 1、研究背景(context):
    随着集成电路技术的发展,半导体工艺尺寸不断减小,处理器时钟频率和片上密集度不断提高,处理器更容易受到空间环境高能粒子影响,发生软错误。寄存器文件作为Cache和处理器流水线之间的临时数据存储节点,经常长时间保存数据并被频繁读取。相比处理器其他模块,寄存器文件更容易发生软错误,且错误数据容易被传播。三模冗余是一种传统的容错方法,可以有效提升寄存器文件的抗软错误能力,但是三模冗余设计会引起大量的面积开销和功耗开销。
    2、目的(Objective):
    本文的研究目的是设计一种针对寄存器文件高效容错方法。该方法仅对影响处理器可靠性的部分寄存器进行保护,对不影响处理器可靠性的寄存器不进行保护,在保证寄存器文件抗软错误能力的同时,尽可能降低容错引起的面积开销和功耗开销,提升容错资源利用率。
    3、方法(Method):
    基于寄存器生命周期分析,本文细化寄存器的状态类型,为不同状态类型的寄存器分配不同的保护优先级。在保护优先级的基础上,我们提出寄存器文件部分保护的选择替换策略,以保证会引起处理器故障的寄存器优先得到保护。进一步,本文将选择替换策略与三模冗余设计融合,针对寄存器文件提出一种既可以保证高软错误纠正能力,有可以削减容错开销的高效容错方法—;—;Partial-TMR。
    4、结果(Result&Findings):
    我们对Partial-TMR进行了软错误覆盖率测试和容错开销比对两个方面实验。软错误覆盖率测试:(1)在整型寄存器文件中,本方法的软错误覆盖率可以达到96.3%。相比于无容错的基线系统,软错误覆盖率提高了29%;相比于全三模冗余系统,有约为3%的降低。(2)在浮点型寄存器文件中,本方法的软错误覆盖率可达99%。相比于无容错的基线系统,软错误覆盖率提高了5%;相比于全三模冗余系统,有约为1%的降低。(3)将Partial-TMR应用在不同配置的四个系统中(1-way/5-stage,2-way/5-stage,2-way/7-stage和4-way/7-stage),本方法的软错误覆盖率分别为:99%、96.6%、96.2%和93.6%。
    容错开销比对测试:(1)相比于全三模冗余,本方法引起的面积开销约为其28.4%,功耗开销约为其35.1%。(2)在1GHz的参考时钟下,相比于基线系统,Partial-TMR容错系统的关键路径延迟增加约为0.03ns。
    5、结论(Conclusions):
    针对寄存器文件,本文提出了一种高效的部分三模冗余方法—;—;Partial-TMR。实验结果表明,本方法可以提升寄存器文件的抗软错误能力,软错误纠正能力与传统全三模冗余接近;同时可以削减容错引起的面积、功耗开销。Partial-TMR易于工程实现,应用于寄存器文件可以有效提升处理器容错能力。Partial-TMR的软错误纠正能力和与TMR存储器项数存在相关性。在处理器配置不变的条件下,确定最优的TMR存储器项数(即以最少的项数实现最大的软错误纠正率)是下一步研究的方向。

     

    Abstract: High-energy particles in the space can easily cause soft error in register file (RF). As a critical structure in a processor, RF often stores data for long periods of time and is read frequently, resulting in a higher probability of spreading corrupted data to other parts of the processor. The triple modular redundancy (TMR) is a common and effective fault tolerance method that enables multi-bit error correction. Designing full TMR for all the registers could cause excessive area and power overheads. However, some registers in RF have less impact on processor reliability. Therefore, there is no need to design TMR for them. This paper designs an efficient strategy which can rate the registers in RF based on their vulnerability. Based on the proposed strategy, a new RF fault tolerance method named Partial-TMR formulates in this paper, which selectively protects more vulnerable registers against multi-bit error, and improves fault tolerance efficiency. For integer RF, Partial-TMR improves its soft error correction capability by 24.5% relative to the baseline system and 3% relative to ParShield, while for floating-point RF, the improvement comes to 5.17% and 0.58% respectively. The soft error correction capability of Partial-TMR is slightly lower than that of full TMR by 1% to 3%, but Partial-TMR significantly cuts the area and power overheads. Compared with full TMR, Partial-TMR decreases the area and power overheads by 71.6% and 64.9%, respectively. It also has little impact on the performance. Partial-TMR is a more cost-effective fault tolerance method compared with ParShield and full TMR.

     

/

返回文章
返回