Journal of Computer Science and Technology ›› 2021, Vol. 36 ›› Issue (5): 1089-1101.doi: 10.1007/s11390-021-0852-8

Special Issue: Computer Architecture and Systems

• Special Section of 2020 CCF Integrated Circuit Design and Automation Conference • Previous Articles     Next Articles

Partial-TMR: A New Method for Protecting Register Files Against Soft Error Based on Lifetime Analysis

Xian-Geng Liang, Ying-Ke Gao*, Member, CCF, and Geng-Xin Hua, Member, CCF        

  1. Beijing Institute of Control Engineering, China Academy of Space Technology, Beijing 100090, China
  • Received:2020-08-05 Revised:2021-08-05 Online:2021-09-30 Published:2021-09-30
  • About author:Xian-Geng Liang received his B.E. degree and M.A. degree in control science and engineering from Beihang University, Beijing, in 2012 and 2015 respectively, and his Ph.D. degree in computer science from China Academy of Space Technology, Beijing, in 2020. He is currently working as an engineer in Beijing Institute of Control Engineering, China Academy of Space Technology, Beijing. His research fields are in computer architecture and reliability.

High-energy particles in the space can easily cause soft error in register file (RF). As a critical structure in a processor, RF often stores data for long periods of time and is read frequently, resulting in a higher probability of spreading corrupted data to other parts of the processor. The triple modular redundancy (TMR) is a common and effective fault tolerance method that enables multi-bit error correction. Designing full TMR for all the registers could cause excessive area and power overheads. However, some registers in RF have less impact on processor reliability. Therefore, there is no need to design TMR for them. This paper designs an efficient strategy which can rate the registers in RF based on their vulnerability. Based on the proposed strategy, a new RF fault tolerance method named Partial-TMR formulates in this paper, which selectively protects more vulnerable registers against multi-bit error, and improves fault tolerance efficiency. For integer RF, Partial-TMR improves its soft error correction capability by 24.5% relative to the baseline system and 3% relative to ParShield, while for floating-point RF, the improvement comes to 5.17% and 0.58% respectively. The soft error correction capability of Partial-TMR is slightly lower than that of full TMR by 1% to 3%, but Partial-TMR significantly cuts the area and power overheads. Compared with full TMR, Partial-TMR decreases the area and power overheads by 71.6% and 64.9%, respectively. It also has little impact on the performance. Partial-TMR is a more cost-effective fault tolerance method compared with ParShield and full TMR.

Key words: register file; soft error; lifetime analysis; selective protection; triple modular redundancy (TMR);

