›› 2017, Vol. 32 ›› Issue (1): 11-25.doi: 10.1007/s11390-017-1703-5

Special Issue: Computer Architecture and Systems; Computer Networks and Distributed Computing

• Special Section on Dataflow Architecture • Previous Articles     Next Articles

An Efficient Network-on-Chip Router for Dataflow Architecture

Xiao-Wei Shen1,2(申小伟), Student Member, CCF, Xiao-Chun Ye1,*(叶笑春), Member, CCF, Xu Tan1,2(谭旭), Student Member, CCF, Da Wang1(王达), Member, CCF, Lunkai Zhang3(张轮凯), Wen-Ming Li1(李文明), Member, CCF, Zhi-Min Zhang1(张志敏), Senior Member, CCF, Dong-Rui Fan1(范东睿), Senior Member, CCF, and Ning-Hui Sun1(孙凝晖), Fellow, CCF   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Department of Computer Science, The University of Chicago, IL 60637, U.S.A
  • Received:2016-09-02 Revised:2016-12-13 Online:2017-01-05 Published:2017-01-05
  • Contact: Xiao-Chun Ye E-mail:yexiaochun@ict.ac.cn
  • About author:Xiao-Wei Shen received his Bachelor's degree in computer science and technology from the School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, in 2010. He is currently a Ph. D. candidate in University of Chinese Academy of Sciences, Beijing. His main research interests include processor micro-architecture and high-performance computer systems.
  • Supported by:

    This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2015AA01A301, the National Natural Science Foundation of China under Grant No. 61332009, the National HeGaoJi Project of China under Grant No. 2013ZX0102-8001-001-001, and the Beijing Municipal Science and Technology Commission under Grant Nos. Z15010101009 and Z151100003615006.

Dataflow architecture has shown its advantages in many high-performance computing cases. In dataflow computing, a large amount of data are frequently transferred among processing elements through the network-on-chip (NoC). Thus the router design has a significant impact on the performance of dataflow architecture. Common routers are designed for control-flow multi-core architecture and we find they are not suitable for dataflow architecture. In this work, we analyze and extract the features of data transfers in NoCs of dataflow architecture:multiple destinations, high injection rate, and performance sensitive to delay. Based on the three features, we propose a novel and efficient NoC router for dataflow architecture. The proposed router supports multi-destination; thus it can transfer data with multiple destinations in a single transfer. Moreover, the router adopts output buffer to maximize throughput and adopts non-flit packets to minimize transfer delay. Experimental results show that the proposed router can improve the performance of dataflow architecture by 3.6x over a state-of-the-art router.

[1] Chen T S, Du Z D, Sun N H, Wang J, Wu C Y, Chen Y J, Temam O. DianNao:A small-footprint high-throughput ac celerator for ubiquitous machine-learning. In Proc. the 19th International Conference on Architectural Support for Pro gramming Languages and Operating Systems, Mar. 2014, pp.269-284.

[2] Liu D F, Chen T S, Liu S L, Zhou J H, Zhou S Y, Temam O, Feng X B, Zhou X H, Chen Y J. PuDianNao:A poly valent machine learning accelerator. In Proc. the 20th In ternational Conference on Architectural Support for Pro gramming Languages and Operating Systems, Mar. 2014, pp.369-381.

[3] Voitsechov D, Etsion Y. Single-graph multiple flows:En ergy efficient design alternative for GPGPUs. In Proc. the 41st Int. Symp. Computer Architecture, Jun. 2014, pp.205-216.

[4] Oriato D, Tilbury S, Marrocu M, Pusceddu G. Acceleration of a meteorological limited area model with dataflow en gines. In Proc. the Symp. Application Accelerators in High Performance Computing, Jul. 2012, pp.129-132.

[5] Pratas F, Oriato D, Pell O, Mata R A, Sousa L. Accelerat ing the computation of induced dipoles for molecular me chanics with dataflow engines. In Proc. the 21st Annual Int. Symp. Field-Programmable Custom Computing Machines, Apr. 2013, pp.177-180.

[6] Fu H H, Gan L, Clapp R G, Ruan H B, Pell O, Mencer O, Flynn M, Huang X M, Yang G W. Scaling reverse time migration performance through reconfigurable dataflow en gines. IEEE Micro, 2014, 34(1):30-40.

[7] Theobald K B. EARTH:An efficient architecture for run ning threads[Ph.D. Thesis]. McGill University, Montreal, Que., Canada, 1999.

[8] Milutinovic V, Salom J, Trifunovic N, Giorgi R. Guide to Dataflow Supercomputing (1st edition). Springer Interna tional Publishing, 2015.

[9] Sankaralingam K, Nagarajan R, McDonald R, Desikan R, Drolia S, Govindan M S, Gratz P, Gulati D, Hanson H, Kim C, Liu H M, Ranganathan N, Sethumadhavan S, Sharif S, Shivakumar P, Keckler S W, Burger D. Distributed microar chitectural protocols in the TRIPS prototype processor. In Proc. the 39th Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2006, pp.480-491.

[10] Burger D, Keckler S W, McKinley K S, Dahlin M, John L K, Lin C, Moore C R, Burrill J, McDonald R G, Yoder W. Scaling to the end of silicon with EDGE architectures. Computer, 2004, 37(7):44-55.

[11] Swanson S, Schwerin A, Mercaldi M, Petersen A, Putnam A, Michelson K, Oskin M, Eggers S J. The WaveScalar architecture. ACM Transactions on Computer Systems, 2007, 25(2):Article No.4.

[12] Roca A, Flich J, Silla F, Duato J. A latency-efficient router architecture for CMP systems. In Proc. the 13th Euromicro Conference on Digital System Design:Architectures, Methods and Tools, Sept. 2010, pp.165-172.

[13] Michelogiannakis G, Dally W J. Router designs for elastic buffer on-chip networks. In Proc. the Conference on High Performance Computing Networking, Storage and Analysis, Nov. 2009.

[14] Chang Y Y, Huang Y S, Poremba M, Narayanan V K, Xie Y, King C T. TS-Router:On maximizing the qualityof-allocation in the on-chip network. In Proc. the 19th Int. Symp. High Performance Computer Architecture, Feb. 2013, pp.390-399.

[15] Tran A T, Baas B M. Achieving high-performance on-chip networks with shared-buffer routers. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2014, 22(6):1391-1403.

[16] Poluri P, Louri A. An improved router design for reliable on-chip networks. In Proc. the 28th Int. Parallel and Distributed Processing Symp., May 2014, pp.283-292.

[17] Ben-Itzhak Y, Cidon I, Kolodny A, Shabun M, Shmuel N. Heterogeneous NoC router architecture. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(6):2479-2492.

[18] Zoni D, Flich J, Fornaciari W. CUTBUF:Buffer management and router design for traffic mixing in VNET-based NoCs. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(6):1603-1616.

[19] Singh W, Deb S. Energy efficient and congestion-aware router design for future NoCs. In Proc. the 29th Int. Conference on VLSI Design, Jan. 2016, pp.81-85.

[20] Yan P Z, Jiang S X, Sridhar R. A high throughput router with a novel switch allocator for network on chip. In Proc. the 28th International System-on-Chip Conference, Sept. 2015, pp.160-163.

[21] Xu Y, Zhao B, Zhang Y T, Yang J. Simple virtual channel allocation for high throughput and high frequency on-chip routers. In Proc. the 16th Int. Symp. High Performance Computer Architecture, Jan. 2010, pp.1-11.

[22] Soteriou V, Ramanujam R S, Lin B, Peh L S. A highthroughput distributed shared-buffer NoC router. IEEE Computer Architecture Letters, 2009, 8(1):21-24.

[23] Gu L, Li M, Siegel J. An empirically tuned 2D and 3D FFT library on CUDA GPU. In Proc. the 24th ACM International Conference on Supercomputing, Jun. 2010, pp.305-314.

[24] Zhang Y P, Mueller F. Autogeneration and autotuning of 3D stencil codes on homogeneous and heterogeneous GPU clusters. IEEE Transactions on Parallel and Distributed Systems, 2013, 24(3):417-427.

[25] Kuzak J, Tomov S, Dongarra J. Autotuning GEMM kernels for the Fermi GPU. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(11):2045-2057.

[26] Hesse R, Nicholls J, Jerger N E. Fine-grained bandwidth adaptivity in networks-on-chip using bidirectional channels. In Proc. the 6th IEEE/ACM Int. Symp. Networks-on-Chip, May 2012, pp.132-141.

[27] Ye X C, Fan D R, Sun N H, Tang S B, Zhang M Z, Zhang H. SimICT:A fast and flexible framework for performance and power evaluation of large-scale architecture. In Proc. the Int. Symp. Low Power Electronics and Design, Sept. 2013, pp.273-278.

[28] Li S, Ahn J H, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT:An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2009, pp.469-480.

[29] Solinas M, Badia R M, Bodin F, Cohen A, Evripidou P, Faraboschi P, Fenchner B, Gao G R, Garbade A, Girbal S, Goodman D, Khan B, Koliai S, Li F, Luján M, Morin L, Mendelson A, Navarro N, Pop A, Trancoso P, Ungerer T, Valero M, Weis S, Watson I, Zuckermann S, Giorgi R. The TERAFLUX project:Exploiting the dataflow paradigm in next generation teradevices. In Proc. the Euromicro Conference on Digital System Design, Sept. 2013, pp.272-279.

[30] Carter N P, Agrawal A, Borkar S, Cledat R, David H, Dunning D, Fryman J, Ganev I, Golliver R A, Knauerhase R, Lethin R, Meister B, Mishra A K, Pinfold W R, Teller J, Torrellas J, Vasilache N, Venkatesh G, Xu J P. Runnemede:An architecture for ubiquitous high-performance computing. In Proc. the 19th Int. Symp. High Performance Computer Architecture, Feb. 2013, pp.198-209.

[31] Wei L, Zhou L. An equilibrium partitioning method for multicast traffic in 3D NoC architecture. In Proc. the IFIP/IEEE International Conference on Very Large Scale Integration, Oct. 2015, pp.128-133.

[32] Agrawal M, Chakrabarty K. Test-time optimization in NOC-based manycore SOCs using multicast routing. In Proc. the 32nd IEEE VLSI Test Symposium, Apr. 2014.

[33] Kamali M, Petre L, Sere K, Daneshtalab M. Formal modeling of multicast communication in 3D NoCs. In Proc. the 14th Euromicro Conference on Digital System Design, Aug. 31-Sept. 2, 2011, pp.634-642.

[34] Zhan J, Ouyang J, Ge F, Zhao J S, Xie Y. Hybrid drowsy SRAM and STT-RAM buffer designs for dark-silicon-aware NoC. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2016, 24(10):3041-3054.

[35] Zhan J, Ouyang J, Ge F, Zhao S, Xie Y. DimNoC:A dim silicon approach towards power-efficient on-chip network. In Proc. the 52nd ACM/EDAC/IEEE Design Automation Conference, Jun. 2015.

[36] Zhang L K, Strukov D, Saadeldeen H, Fan D R, Zhang M Z, Franklin D. SpongeDirectory:Flexible sparse directories utilizing multi-level memristors. In Proc. the 23rd Int. Conf. Parallel Architectures and Compilation, Aug. 2014, pp.61-74.

[37] Deng Z X, Zhang L K, Franklin D, Chong F T. Herniated hash tables:Exploiting multi-level phase change memory for in-place data expansion. In Proc. the Int. Symp. Memory Systems, Oct. 2015, pp.247-257.

[38] Zhang M Z, Zhang L K, Jiang L, Liu Z Y, Chong F T. Balancing performance and lifetime of MLC PCM by using a regionretention monitor. In Proc. the 23rd Int. Symp. High Performance Computer Architecture, Feb. 2017. (to be appeared)

[39] LeeH H S, Tyson G S, Farrens M K. Eager writebacka technique for improving bandwidth utilization. In Proc. the 33rd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2000, pp.11-21.

[40] Zhang L K, Neely B, Franklin D, Strukov D, Xie Y, Chong F T. Mellow writes:Extending lifetime in resistive memories through selective slow write backs. In Proc. the 43rd Int. Symp. Computer Architecture, Jun. 2016, pp.519-531. 314.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved