|  Ma K, Li X, Chen W et al. Green GPU: A holistic approach to energy efficiency in GPU-CPU heterogeneous architectures. In Proc. the 41st Int. Conf. Parallel Processing, September 2012, pp.48-57. Lee J, Samadi M, Park Y et al. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proc. the 22nd Int. Conf. Parallel Architectures and Compilation Techniques, Sept. 2013, pp.245-255. Lee J, Kim H. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In Proc. the 18th Int. Symp. High Performance Computer Architecture, February 2012, pp.91-102. Borkar S. Thousand core chips: A technology perspective. In Proc. the 44th Conf. Design Automation, June 2007, pp.746-749. Hoskote Y, Vangal S, Singh A et al. A 5-GHz mesh interconnect for a teraflops processor. IEEE Micro, 2007, 27(5): 51-61. Owens J D, Dally W J, Ho R et al. Research challenges for on-chip interconnection networks. IEEE Micro, 2007, 27(5): 96-108. Wentzlaff D, Griffin P, Hoffmann H et al. On-chip interconnection architecture of the tile processor. IEEE Micro, 2007, 27(5): 15-31. Taylor M B, Lee W, Miller J et al. Evaluation of the raw microprocessor: An exposed-wire-delay architecture for ILP and streams. ACM SIGARCH Computer Architecture News, 2004, 32(2): 2-13. Moscibroda T, Mutlu O. A case for bufferless routing in on-chip networks. ACM SIGARCH Computer Architecture News, 2009, 37(3): 196-207. Michelogiannakis G, Sanchez D, Dallv W J et al. Evaluating bufferless flow control for on-chip networks. In Proc. the 4th Int. Symp. Networks-on-Chip, May 2010, pp.9-16. Jafri S A R, Hong Y J, Thottethodi M et al. Adaptive flow control for robust performance and energy. In Proc. the 43rd Int. Symp. Microarchitecture, December 2010, pp.433-444. Nychis G P, Fallin C, Moscibroda T et al. On-chip networks from a networking perspective: Congestion and scalability in many-core interconnects. ACM SIGCOMM Computer Communication Review, 2012, 42(4): 407-418. Fallin C, Craik C, Mutlu O. CHIPPER: A low-complexity bufferless deflection router. In Proc. the 17th Int. Symp. High Performance Computer Architecture, February 2011, pp.144-155. Zhao H, Kandemir M, Ding W et al. Exploring heterogeneous NoC design space. In Proc. Int. Conf. ComputerAided Design, November 2011, pp.787-793. Nilsson E. Design and implementation of a hot-potato switch in a network on chip [Master Thesis]. Royal Institute of Technology, Sweden, 2002. Lee J, Li S, Kim H et al. Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures. ACM Trans. Design Automation of Electronic Systems, 2013, 18(4): 48:1-48:28. Kim H, Kim Y, Kim J. Clumsy flow control for highthroughput bufferless on-chip networks. IEEE Computer Architecture Letters, 2013, 12(2): 47-50. Kahng A B, Li B, Peh L S et al. ORION 2.0: A power-area simulator for interconnection networks. IEEE Trans. Very Large Scale Integration Systems, 2012, 20(1): 191-196. Henning J L. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 2006, 34(4): 1-17. Che S, Boyer M, Meng J et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. Int. Symp. Workload Characterization, October 2009, pp.44-54. Patil H, Cohn R, Charnev M et al. Pinpointing representative portions of large Intel® Itanium® programs with dynamic instrumentation. In Proc. the 37th Int. Symp. Microarchitecture, December 2004, pp.81-92. Grot B, Hestness J, Keckler S W, Multu O. Express cube topologies for on-chip interconnects. In Proc. the 15th Int. Symp. High Performance Computer Architecture, February 2009, pp.163-174. Balfour J, Dally W J, Black-Schaffer D et al. An energyefficient processor architecture for embedded systems. IEEE Computer Architecture Letters, 2008, 7(1):29-32.