›› 2013,Vol. 28 ›› Issue (6): 1045-1053.doi: 10.1007/s11390-013-1396-3

所属专题: Computer Architecture and Systems Computer Networks and Distributed Computing

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

RevivePath:基于数据通路挽回的可靠片上网络设计技术

Yin-He Han1, 2 (韩银和), Senior Member, CCF, IEEE, Member, ACM, Cheng Liu1, 2 (刘成), Hang Lu1, 2 (路航), Student Member, CCF, IEEE, Wen-Bo Li1, 2 (李文博), Lei Zhang1, 2 (张磊), Member, CCF, IEEE, and Xiao-Wei Li1, 2 (李晓维), Senior Member, CCF, IEEE   

  • 收稿日期:2012-10-16 修回日期:2013-08-15 出版日期:2013-11-05 发布日期:2013-11-05
  • 作者简介:Yin-He Han received the B.Eng. degree from Nanjing University of Aeronautics and Astronautics, China, in 2001, and the M. Eng. and Ph.D. degrees in computer science from the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing, in 2003 and 2006, respectively. He is currently an associate professor at ICT, CAS. His research interests include VLSI architecture design and test, especially on fault-tolerant and low power architecture. Dr. Han was a recipient of Best Paper Award at Asian Test Symposium (ATS) 2003. He is a member of IEEE/ACM/CCF/IEICE. He is the program chair of ATS 2014, finance chair of HPCA 2013, program co-chair of WRTLT 2009, and has served and serves on the technical program committees of several IEEE and ACM conferences, including HPCA 2013, ASPDAC 2013, Cool Chip 2013, ATS 2008~2010, GVLSI 2009~2010, etc.

RevivePath:Resilient Network-on-Chip Design Through Data Path Salvaging of Router

Yin-He Han1, 2 (韩银和), Senior Member, CCF, IEEE, Member, ACM, Cheng Liu1, 2 (刘成), Hang Lu1, 2 (路航), Student Member, CCF, IEEE, Wen-Bo Li1, 2 (李文博), Lei Zhang1, 2 (张磊), Member, CCF, IEEE, and Xiao-Wei Li1, 2 (李晓维), Senior Member, CCF, IEEE   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2. University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2012-10-16 Revised:2013-08-15 Online:2013-11-05 Published:2013-11-05
  • About author:Yin-He Han received the B.Eng. degree from Nanjing University of Aeronautics and Astronautics, China, in 2001, and the M. Eng. and Ph.D. degrees in computer science from the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing, in 2003 and 2006, respectively. He is currently an associate professor at ICT, CAS. His research interests include VLSI architecture design and test, especially on fault-tolerant and low power architecture. Dr. Han was a recipient of Best Paper Award at Asian Test Symposium (ATS) 2003. He is a member of IEEE/ACM/CCF/IEICE. He is the program chair of ATS 2014, finance chair of HPCA 2013, program co-chair of WRTLT 2009, and has served and serves on the technical program committees of several IEEE and ACM conferences, including HPCA 2013, ASPDAC 2013, Cool Chip 2013, ATS 2008~2010, GVLSI 2009~2010, etc.
  • Supported by:

    The work was supported in part by the National Basic Research 973 Program of China under Grant No. 2011CB302503, and the National Natural Science Foundation of China under Grant Nos. 61076037, 60906018, 60921002.

片上网络因为具有很好的扩展性和能提供较高带宽,被视为未来大规模片上系统互连的一种极有前景的技术。然而,随着工艺尺寸的变小及集成密度的增加,片上网络将变得不可靠。同时,对于片上网络而言,任意单节点故障可能会破坏全网络的连通性,而使全网络崩溃。冗余技术是一种常用的可靠性增强技术,然而,先前的冗余设计,如冗余部件划分粒度较粗,则可靠性不足,而如冗余部件粒度较细,则会带来过大的面积开销。本文避开了这一问题。我们首先通过观察发现,片上路由器数据传输通道部件,比如连接线、缓存或是交叉开关,都可以划分为多个同构的子部件,而这些子部件可作为本征冗余来使用。本文即是利用了这一本征冗余,提出了RevivalPath技术,该技术能实现任一子部件正常工作下则整个片上路由器的功能就正常。对于片上路由器中的控制部分如交换仲裁器、路由计算部件等,则使用直接冗余的方法来保护。实验结果显示,本方法能提供较高的可靠性,即使在较高的故障率情况下,也能实现网络性能的优雅降级。

Abstract: Network-on-Chip (NoC) with excellent scalability and high bandwidth has been considered to be the most promising communication architecture for complex integration systems. However, NoC reliability is getting continuously challenging for the shrinking semiconductor feature size and increasing integration density. Moreover, a single node failure in NoC might destroy the network connectivity and corrupt the entire system. Introducing redundancies is an efficient method to construct a resilient communication path. However, prior work based on redundancies, either results in limited reliability with coarse grain protection or involves even larger hardware overhead with fine grain. In this paper, we notice that data path such as links, buffers and crossbars in NoC can be divided into multiple identical parallel slices, which can be utilized as inherent redundancy to enhance reliability. As long as there is one fault-free slice left available, the proposed salvaging scheme named as RevivePath, can be employed to make the overall data path still functional. Furthermore, RevivePath uses the direct redundancy to protect the control path such as switch arbiter, routing computation, to provide a full fault-tolerant scheme to the whole router. Experimental results show that it achieves quite high reliability with graceful performance degradation even under high fault rate.

[1] Benini L, De Micheli G. Networks on chips: A new SoC paradigm. Computer, 2002, 35(1): 70-78.

[2] De Micheli G, Benini L. Networks on Chips: Technology and Tools. Morgan Kaufmann Pub, 2006.

[3] Borkar S. Microarchitecture and design challenges for gigascale integration. In Proc. the 37th International Symposium on Microarchitecture, Dec. 2004, p.3.

[4] Dally W, Towles B. Route packets, not wires: On-chip interconnection networks. In Proc. Design Automation Conference, June 2001, pp.684-689.

[5] Borkar S. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro, 2005, 25(6): 10-16.

[6] Constantinescu C. Trends and challenges in VLSI circuit reliability. IEEE Micro, 2003, 23(4): 14-19.

[7] Zhang L, Han Y, Xu Q et al. On topology reconfiguration for defect-tolerant NoC-based homogeneous manycore systems. IEEE Trans. Very Large Scale Integration Systems, 2009, 17(9): 1173-1186.

[8] Boppana R V, Chalasani S. Fault-tolerant routing with nonadaptive wormhole algorithms in mesh networks. In Proc. Supercomputing, Nov. 1994, pp.693-702.

[9] Zhang Z, Greiner A, Taktak S. A reconfigurable routing algorithm for a fault-tolerant 2D-mesh network-on-chip. In Proc. Design Automation Conference, June 2008, pp.441-446.

[10] Flick D, DeOrio A, Chen G et al. A highly resilient routing algorithm for fault-tolerant NoCs. In Proc. Conf. Design, Automation and Test in Europe, April 2009, pp.21-26.

[11] Flich J, Rodrigo S, Duato J. An efficient implementation of distributed routing algorithms for NoCs. In Proc. Int. Symp. Networks-on-Chip, April 2008, pp.87-96.

[12] Wang J, Gu H, Yang Y et al. An energyand buffer-aware fully adaptive routing algorithm for Network-on-Chip. Microelectronics Journal, 2013, 44(2): 137-144.

[13] Xiang D, Zhang Y, Pan Y. Practical deadlock-free faulttolerant routing in meshes based on the planar network fault model. IEEE Trans. Computers, 2009, 58(5): 620-633.

[14] Xiang D, Luo W. An efficient adaptive deadlock-free routing algorithm for torus networks. IEEE Trans. Parallel and Distributed System, 2012, 23(5): 800-808.

[15] Siewiorek D, Swarz R. Reliable Computer Systems: Design and Evaluation (3rd edition). A K Peters/CRC Press, 1998.

[16] Smolens J, Gold B, Kim J et al. Fingerprinting: Bounding soft-error-detection latency and bandwidth. In Proc. the 11th Int. Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 2004, pp.224-234.

[17] Weaver C, Austin T. A fault tolerant approach to microprocessor design. In Proc. International Conference on Dependable Systems and Networks, June 2001, pp.411-420.

[18] Constantinides K, Plaza S, Blome J et al. BulletProof: A defect-tolerant CMP switch architecture. In Proc. the 12th International Symposium on High-Performance Computer Architecture, Feb. 2006, pp.5-16.

[19] Hegde R, Shanbhag N R. Toward achieving energy efficiency in presence of deep submicronnoise. IEEE Trans. Very Large Scale Integration Systems, 2000, 8(4): 379-391.

[20] Kim J, Park D, Nicopoulos C et al. Design and analysis of an NoC architecture from performance, reliability and energy perspective. In Proc. Int. Symp. Architecture for Networking and Communications Systems, Oct. 2005, pp.173-182.

[21] Murali S, Atienza D, Benini L et al. A multi-path routing strategy with guaranteed in-order packet delivery and faulttolerance for networks on chip. In Proc. Design Automation Conference, June 2006, pp.845-848.

[22] Koibuchi M, Matsutani H, Amano H et al. A lightweight fault-tolerant mechanism for network-on-chip. In Proc. ACM/IEEE International Symposium on Networks-on-Chip, April 2008, pp.13-22.

[23] Fick D, DeOrio A, Hu J et al. Vicis: A reliable network for unreliable silicon. In Proc. the 46th Design Automation Conference, July 2009, pp.812-817.

[24] Palesi M, Kumar S, Catania V. Leveraging partially faulty links usage for enhancing yield and performance in networkson-chip. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 2010, 29(3): 426-440.

[25] Alaghi A, Karimi N, Sedghi M et al. Online NoC switch fault detection and diagnosis using a high level fault model. In Proc. International Symposium on Defect and FaultTolerance in VLSI Systems, Sept. 2007, pp.21-29.

[26] Gomez M E, Duato J, Flich J et al. An efficient fault-tolerant routing methodology for meshes and tori. Computer Architecture Letters, 2004, 3(1): 3.

[27] Ho C T, Stockmeyer L. A new approach to fault-tolerant wormhole routing for mesh-connected parallel computers. IEEE Trans. Computers, 2004, 53(4): 427-438.

[28] Han Y, Xu Y, Li H et al. Test resource partitioning based on efficient response compaction for test time and tester channels reduction. In Proc. Asian Test Symposium, Nov. 2003, pp.440-445.

[29] Han Y, Xu Y, Chandra A et al. Test resource partitioning based on efficient response compaction for test time and tester channels reduction. Journal of Computer Science and Technology, 2005, 20(2): 201-210.

[30] Han Y, Hu Y, Li X et al. Embedded test decompressor to reduce the required channels and vector memory of tester for complex processor circuit. IEEE Trans. Very Large Scale Integration Systems, 2007, 15(5): 531-540.

[31] Han Y, Hu Y, Li H et al. Theoretic analysis and enhanced X-tolerance of test response compact based on convolutional code. In Proc. the 2005 Asia and South Pacific Design Automation Conference, Jan. 2005, pp.53-58.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] 李明慧;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: