We use cookies to improve your experience with our site.

面向高效能容错Cache设计的精确有效的AVF预测研究

Accurate and Simplified Prediction of AVF for Delay and Energy Efficient Cache Design

  • 摘要: 1.本文的创新点
    本文对存储部件的体系结构弱点因子(Architectural Vulnerability Factor,AVF)计算方法进行了改进,并将改进的AVF计算方法融入到通用体系结构级模拟器中,提出了一种通用、精确的AVF评估框架SS-SERA (Soft Error Reliability Analysis based on SimpleScalar)。基于SS-SERA,本文对L1数据Cache(L1D)的AVF动态特性和可预测性进行了定性和定量分析,并发现在程序运行过程中,L1D AVF的动态变化特征和部分关键性能参数之间存在一定的相关性。因此,本文进一步提出使用贝叶斯累加回归树模型(Bayesian Additive Regression Trees, BART)对L1D AVF的动态特性建立更精确的预测模型,并和推进回归树(Boosted Regression Tree,BRT)和线性回归(Linear Regression)这两种典型预测方法进行对比。实验结果表明BART在精确性和鲁棒性方面都在不同程度上优于BRT和线性回归。此外,本文基于块搜索(Bump Hunting)技术提出一种更简单、快速、有效的L1D AVF预测策略。基于对AVF的快速、有效预测,本文最后提出一种基于AVF预测的动态容错管理策略—AVF-aware ECC。该策略基于AVF的实时预测,通过和可靠性判断阈值比较,动态选择是否对Cache进行ECC保护。实验结果表明,相对于传统的ECC保护策略,AVF-aware ECC策略能够在满足可靠性要求的前提下有效降低Cache中ECC保护带来的延迟和功耗开销。2.实现方法
    本文从体系结构级对面向高效能容错Cache设计的AVF预测进行研究。本文在通用体系结构级模拟器Simplescalar上实现了改进的AVF评估框架,基于该评估框架得到Cache AVF和性能参数的大量offline测量值。本文将这些offline测量值划分为训练集(train set)和测试集(test set)。基于训练集中的offline测量值,利用BART、BRT以及线性回归模型分别对L1D AVF和关键性能参数之间的关系进行统计建模,并利用得到的预测模型对测试集进行预测,通过比较分析对预测模型的精确性和鲁棒性进行评估。本文进一步利用Bump Hunting技术建立了简单有效的L1D AVF预测器。最后对模拟器进行适当改动,对AVF-aware ECC策略的功耗和延迟开销进行模拟实验和评估。3.结论及未来待解决的问题
    Cache是微处理器上对软错误最敏感的部件,因此必须对Cache进行容错保护。传统的Cache保护机制(如ECC)在程序执行的整个生命周期内都对Cache进行容错保护,带来了巨大的性能和功耗开销。
    本文为了降低传统Cache容错技术的开销,提出AVF-aware动态容错管理技术,即根据Cache AVF的变化特性来动态选择是否对Cache进行容错保护。本文在对比分析BART、BRT以及线性回归模型的精确性和鲁棒性的基础上,进一步提出利用Bump Hunting技术建立更加简单、快速、有效的L1D AVF预测器,为实现AVF-aware低开销动态容错管理机制奠定了基础。本文最后实现了AVF-aware ECC策略,对基于AVF预测的动态管理机制的开销进行了定量评估。实验结果表明,相比于传统ECC保护技术,AVF-aware ECC技术的延迟和功耗开销分别降低35%和14%。未来我们将从软硬件结合角度进一步研究基于AVF预测的低开销动态容错技术。未来我们将从软硬件结合角度进一步研究基于AVF预测的低开销动态容错技术。4.实用价值或应用前景
    随着集成电路的进步,微处理器面临着越来越严重的软错误问题。在体系结构级开发低开销的软错误保护技术已经成为近年来软错误研究领域的一个热点。体系结构弱点因子AVF是最常用的可靠性评估指标之一。研究人员发现根据程序运行时部件AVF的变化来动态选择是否对部件进行容错保护,能够有效降低容错带来的功耗和性能开销。AVF评估的精确性、实时性、可行性是实现AVF-aware低开销动态容错技术的前提。利用本文建立的AVF评估框架,以及基于BART和Bump Hunting的快速有效的AVF预测方法,可以对更多部件进行AVF-aware低开销动态容错设计。
    目前,微处理器的评价已不仅仅局限于对性能的衡量。微处理器设计需要考虑包括功耗、可靠性以及芯片面积在内的多种因素。简单有效的AVF预测器可以较好地融入到动态容错管理机制中,实现低开销的容错保护。基于AVF预测的AVF-aware动态容错管理技术能够应用于容软错误微处理器体系结构设计中,从而更好地在可靠性、性能和功耗之间进行权衡。

     

    Abstract: With continuous technology scaling, on-chip structures are becoming more and more susceptible to soft errors. Architectural vulnerability factor (AVF) has been introduced to quantify the architectural vulnerability of on-chip structures to soft errors. Recent studies have found that designing soft error protection techniques with the awareness of AVF is greatly helpful to achieve a tradeoff between performance and reliability for several structures (i.e., issue queue, reorder buffer). Cache is one of the most susceptible components to soft errors and is commonly protected with error correcting codes (ECC). However, protecting caches closer to the processor (i.e., L1 data cache (L1D)) using ECC could result in high overhead. Protecting caches without accurate knowledge of the vulnerability characteristics may lead to over-protection. Therefore, designing AVF-aware ECC is attractive for designers to balance among performance, power and reliability for cache, especially at early design stage. In this paper, we improve the methodology of cache AVF computation and develop a new AVF estimation framework, soft error reliability analysis based on SimpleScalar. Then we characterize dynamic vulnerability behavior of L1D and detect the correlations between L1D AVF and various performance metrics. We propose to employ Bayesian additive regression trees to accurately model the variation of L1D AVF and to quantitatively explain the important effects of several key performance metrics on L1D AVF. Then, we employ bump hunting technique to reduce the complexity of L1D AVF prediction and extract some simple selecting rules based on several key performance metrics, thus enabling a simplified and fast estimation of L1D AVF. Based on the simplified and fast estimation of L1D AVF, intervals of high L1D AVF can be identified online, enabling us to develop the AVF-aware ECC technique to reduce the overhead of ECC. Experimental results show that compared with traditional ECC technique which provides complete ECC protection throughout the entire lifetime of a program, AVF-aware ECC technique reduces the L1D access latency by 35% and saves power consumption by 14% for SPEC2K benchmarks averagely.

     

/

返回文章
返回