›› 2015, Vol. 30 ›› Issue (1): 74-83.doi: 10.1007/s11390-015-1505-6

Special Issue: Computer Architecture and Systems

• Special Section on Computer Architecture and Systems for Big Data • Previous Articles     Next Articles

Exploring Heterogeneous NoC Design Space in Heterogeneous GPU-CPU Architectures

Juan Fang1(方娟), Member, CCF, IEEE, Zhen-Yu Leng1(冷镇宇), Si-Tong Liu1(刘思彤), Zhi-Cheng Yao2(姚治成), Member, CCF, IEEE, Xiu-Feng Sui2(隋秀峰), Member, CCF, IEEE   

  1. 1 College of Computer Science, Beijing University of Technology, Beijing 100124, China;
    2 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
  • Received:2014-07-15 Revised:2014-11-12 Online:2015-01-05 Published:2015-01-05
  • About author:Juan Fang received her Ph.D. degree in computer application technology from Beijing University of Technology in 2005. Currently she is an associate professor in College of Computer Science, Beijing University of Technology. Her research interests include multi-core computing and its application technology, and cloud computing.
  • Supported by:

    This work was supported by the National Natural Science Foundation of China under Grant Nos. 61202076, 61202062.

Computer architecture is transiting from the multicore era into the heterogeneous era in which heterogeneous architectures use on-chip networks to access shared resources and how a network is configured will likely have a significant impact on overall performance and power consumption. Recently, heterogeneous network on chip (NoC) has been proposed not only to achieve performance comparable to that of the NoCs with buffered routers but also to reduce buffer cost and energy consumption. However, heterogeneous NoC design for heterogeneous GPU-CPU architectures has not been studied in depth. This paper first evaluates the performance and power consumption of a variety of static hot-potato based heterogeneous NoCs with different buffered and bufferless router placements, which is helpful to explore the design space for heterogeneous GPU-CPU interconnection. Then it proposes Unidirectional Flow Control (UFC), a simple credit-based flow control mechanism for heterogeneous NoC in GPU-CPU architectures to control network congestion. UFC can guarantee that there are always unoccupied entries in buffered routers to receive flits coming from adjacent bufferless routers. Our evaluations show that when compared to hot-potato routing, UFC improves performance by an average of 14.1% with energy increased by an average of 5.3% only.

[1] Ma K, Li X, Chen W et al. Green GPU: A holistic approach to energy efficiency in GPU-CPU heterogeneous architectures. In Proc. the 41st Int. Conf. Parallel Processing, September 2012, pp.48-57.

[2] Lee J, Samadi M, Park Y et al. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proc. the 22nd Int. Conf. Parallel Architectures and Compilation Techniques, Sept. 2013, pp.245-255.

[3] Lee J, Kim H. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In Proc. the 18th Int. Symp. High Performance Computer Architecture, February 2012, pp.91-102.

[4] Borkar S. Thousand core chips: A technology perspective. In Proc. the 44th Conf. Design Automation, June 2007, pp.746-749.

[5] Hoskote Y, Vangal S, Singh A et al. A 5-GHz mesh interconnect for a teraflops processor. IEEE Micro, 2007, 27(5): 51-61.

[6] Owens J D, Dally W J, Ho R et al. Research challenges for on-chip interconnection networks. IEEE Micro, 2007, 27(5): 96-108.

[7] Wentzlaff D, Griffin P, Hoffmann H et al. On-chip interconnection architecture of the tile processor. IEEE Micro, 2007, 27(5): 15-31.

[8] Taylor M B, Lee W, Miller J et al. Evaluation of the raw microprocessor: An exposed-wire-delay architecture for ILP and streams. ACM SIGARCH Computer Architecture News, 2004, 32(2): 2-13.

[9] Moscibroda T, Mutlu O. A case for bufferless routing in on-chip networks. ACM SIGARCH Computer Architecture News, 2009, 37(3): 196-207.

[10] Michelogiannakis G, Sanchez D, Dallv W J et al. Evaluating bufferless flow control for on-chip networks. In Proc. the 4th Int. Symp. Networks-on-Chip, May 2010, pp.9-16.

[11] Jafri S A R, Hong Y J, Thottethodi M et al. Adaptive flow control for robust performance and energy. In Proc. the 43rd Int. Symp. Microarchitecture, December 2010, pp.433-444.

[12] Nychis G P, Fallin C, Moscibroda T et al. On-chip networks from a networking perspective: Congestion and scalability in many-core interconnects. ACM SIGCOMM Computer Communication Review, 2012, 42(4): 407-418.

[13] Fallin C, Craik C, Mutlu O. CHIPPER: A low-complexity bufferless deflection router. In Proc. the 17th Int. Symp. High Performance Computer Architecture, February 2011, pp.144-155.

[14] Zhao H, Kandemir M, Ding W et al. Exploring heterogeneous NoC design space. In Proc. Int. Conf. ComputerAided Design, November 2011, pp.787-793.

[15] Nilsson E. Design and implementation of a hot-potato switch in a network on chip [Master Thesis]. Royal Institute of Technology, Sweden, 2002.

[16] Lee J, Li S, Kim H et al. Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures. ACM Trans. Design Automation of Electronic Systems, 2013, 18(4): 48:1-48:28.

[17] Kim H, Kim Y, Kim J. Clumsy flow control for highthroughput bufferless on-chip networks. IEEE Computer Architecture Letters, 2013, 12(2): 47-50.

[18] Kahng A B, Li B, Peh L S et al. ORION 2.0: A power-area simulator for interconnection networks. IEEE Trans. Very Large Scale Integration Systems, 2012, 20(1): 191-196.

[19] Henning J L. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 2006, 34(4): 1-17.

[20] Che S, Boyer M, Meng J et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. Int. Symp. Workload Characterization, October 2009, pp.44-54.

[21] Patil H, Cohn R, Charnev M et al. Pinpointing representative portions of large Intel® Itanium® programs with dynamic instrumentation. In Proc. the 37th Int. Symp. Microarchitecture, December 2004, pp.81-92.

[22] Grot B, Hestness J, Keckler S W, Multu O. Express cube topologies for on-chip interconnects. In Proc. the 15th Int. Symp. High Performance Computer Architecture, February 2009, pp.163-174.

[23] Balfour J, Dally W J, Black-Schaffer D et al. An energyefficient processor architecture for embedded systems. IEEE Computer Architecture Letters, 2008, 7(1):29-32.
No related articles found!
Full text



[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved