›› 2015,Vol. 30 ›› Issue (1): 74-83.doi: 10.1007/s11390-015-1505-6

所属专题: Computer Architecture and Systems

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

面向GPU-CPU异构体系结构的异构片上网络设计空间探索

Juan Fang1(方娟), Member, CCF, IEEE, Zhen-Yu Leng1(冷镇宇), Si-Tong Liu1(刘思彤), Zhi-Cheng Yao2(姚治成), Member, CCF, IEEE, Xiu-Feng Sui2(隋秀峰), Member, CCF, IEEE   

  1. 1 College of Computer Science, Beijing University of Technology, Beijing 100124, China;
    2 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
  • 收稿日期:2014-07-15 修回日期:2014-11-12 出版日期:2015-01-05 发布日期:2015-01-05
  • 作者简介:Juan Fang received her Ph.D. degree in computer application technology from Beijing University of Technology in 2005. Currently she is an associate professor in College of Computer Science, Beijing University of Technology. Her research interests include multi-core computing and its application technology, and cloud computing.
  • 基金资助:

    This work was supported by the National Natural Science Foundation of China under Grant Nos. 61202076, 61202062.

Exploring Heterogeneous NoC Design Space in Heterogeneous GPU-CPU Architectures

Juan Fang1(方娟), Member, CCF, IEEE, Zhen-Yu Leng1(冷镇宇), Si-Tong Liu1(刘思彤), Zhi-Cheng Yao2(姚治成), Member, CCF, IEEE, Xiu-Feng Sui2(隋秀峰), Member, CCF, IEEE   

  1. 1 College of Computer Science, Beijing University of Technology, Beijing 100124, China;
    2 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
  • Received:2014-07-15 Revised:2014-11-12 Online:2015-01-05 Published:2015-01-05
  • About author:Juan Fang received her Ph.D. degree in computer application technology from Beijing University of Technology in 2005. Currently she is an associate professor in College of Computer Science, Beijing University of Technology. Her research interests include multi-core computing and its application technology, and cloud computing.
  • Supported by:

    This work was supported by the National Natural Science Foundation of China under Grant Nos. 61202076, 61202062.

计算机体系结构已经从传统的多核时代跨越到异构时代.异构体系结构通过片上网络来访问共享资源,使得其性能和功耗受到片上网络配置的显著影响.近期提出的异构片上网络不仅其性能进一步接近于传统的有缓存片上网络,而且其面积开销和功耗也明显下降.然而,面向GPU-CPU异构体系结构的异构片上网络设计目前尚未深入研究.为此,本文首先通过调整有缓存或无缓存路由器的放置方式,对多种基于"热土豆"路由的静态异构片上网络的性能和功耗进行了评估,实验结果对于进一步探索片上互联的设计空间是大有裨益的.其次,本文提出一种面向异构片上网络的基于信用的单向流控机制(UFC)来控制网络拥塞,从而保证有缓存路由器始终有空闲的缓冲区保存来自于相邻无缓存路由器的分片.实验结果显示,相比于"热土豆"路由,UFC可以将异构片上网络的性能平均提升14.1%,而网络的功耗平均提升仅有5.3%.

Abstract: Computer architecture is transiting from the multicore era into the heterogeneous era in which heterogeneous architectures use on-chip networks to access shared resources and how a network is configured will likely have a significant impact on overall performance and power consumption. Recently, heterogeneous network on chip (NoC) has been proposed not only to achieve performance comparable to that of the NoCs with buffered routers but also to reduce buffer cost and energy consumption. However, heterogeneous NoC design for heterogeneous GPU-CPU architectures has not been studied in depth. This paper first evaluates the performance and power consumption of a variety of static hot-potato based heterogeneous NoCs with different buffered and bufferless router placements, which is helpful to explore the design space for heterogeneous GPU-CPU interconnection. Then it proposes Unidirectional Flow Control (UFC), a simple credit-based flow control mechanism for heterogeneous NoC in GPU-CPU architectures to control network congestion. UFC can guarantee that there are always unoccupied entries in buffered routers to receive flits coming from adjacent bufferless routers. Our evaluations show that when compared to hot-potato routing, UFC improves performance by an average of 14.1% with energy increased by an average of 5.3% only.

[1] Ma K, Li X, Chen W et al. Green GPU: A holistic approach to energy efficiency in GPU-CPU heterogeneous architectures. In Proc. the 41st Int. Conf. Parallel Processing, September 2012, pp.48-57.

[2] Lee J, Samadi M, Park Y et al. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proc. the 22nd Int. Conf. Parallel Architectures and Compilation Techniques, Sept. 2013, pp.245-255.

[3] Lee J, Kim H. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In Proc. the 18th Int. Symp. High Performance Computer Architecture, February 2012, pp.91-102.

[4] Borkar S. Thousand core chips: A technology perspective. In Proc. the 44th Conf. Design Automation, June 2007, pp.746-749.

[5] Hoskote Y, Vangal S, Singh A et al. A 5-GHz mesh interconnect for a teraflops processor. IEEE Micro, 2007, 27(5): 51-61.

[6] Owens J D, Dally W J, Ho R et al. Research challenges for on-chip interconnection networks. IEEE Micro, 2007, 27(5): 96-108.

[7] Wentzlaff D, Griffin P, Hoffmann H et al. On-chip interconnection architecture of the tile processor. IEEE Micro, 2007, 27(5): 15-31.

[8] Taylor M B, Lee W, Miller J et al. Evaluation of the raw microprocessor: An exposed-wire-delay architecture for ILP and streams. ACM SIGARCH Computer Architecture News, 2004, 32(2): 2-13.

[9] Moscibroda T, Mutlu O. A case for bufferless routing in on-chip networks. ACM SIGARCH Computer Architecture News, 2009, 37(3): 196-207.

[10] Michelogiannakis G, Sanchez D, Dallv W J et al. Evaluating bufferless flow control for on-chip networks. In Proc. the 4th Int. Symp. Networks-on-Chip, May 2010, pp.9-16.

[11] Jafri S A R, Hong Y J, Thottethodi M et al. Adaptive flow control for robust performance and energy. In Proc. the 43rd Int. Symp. Microarchitecture, December 2010, pp.433-444.

[12] Nychis G P, Fallin C, Moscibroda T et al. On-chip networks from a networking perspective: Congestion and scalability in many-core interconnects. ACM SIGCOMM Computer Communication Review, 2012, 42(4): 407-418.

[13] Fallin C, Craik C, Mutlu O. CHIPPER: A low-complexity bufferless deflection router. In Proc. the 17th Int. Symp. High Performance Computer Architecture, February 2011, pp.144-155.

[14] Zhao H, Kandemir M, Ding W et al. Exploring heterogeneous NoC design space. In Proc. Int. Conf. ComputerAided Design, November 2011, pp.787-793.

[15] Nilsson E. Design and implementation of a hot-potato switch in a network on chip [Master Thesis]. Royal Institute of Technology, Sweden, 2002.

[16] Lee J, Li S, Kim H et al. Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures. ACM Trans. Design Automation of Electronic Systems, 2013, 18(4): 48:1-48:28.

[17] Kim H, Kim Y, Kim J. Clumsy flow control for highthroughput bufferless on-chip networks. IEEE Computer Architecture Letters, 2013, 12(2): 47-50.

[18] Kahng A B, Li B, Peh L S et al. ORION 2.0: A power-area simulator for interconnection networks. IEEE Trans. Very Large Scale Integration Systems, 2012, 20(1): 191-196.

[19] Henning J L. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 2006, 34(4): 1-17.

[20] Che S, Boyer M, Meng J et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. Int. Symp. Workload Characterization, October 2009, pp.44-54.

[21] Patil H, Cohn R, Charnev M et al. Pinpointing representative portions of large Intel® Itanium® programs with dynamic instrumentation. In Proc. the 37th Int. Symp. Microarchitecture, December 2004, pp.81-92.

[22] Grot B, Hestness J, Keckler S W, Multu O. Express cube topologies for on-chip interconnects. In Proc. the 15th Int. Symp. High Performance Computer Architecture, February 2009, pp.163-174.

[23] Balfour J, Dally W J, Black-Schaffer D et al. An energyefficient processor architecture for embedded systems. IEEE Computer Architecture Letters, 2008, 7(1):29-32.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] 李明慧;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: