We use cookies to improve your experience with our site.
Xiao-Wei Shen, Xiao-Chun Ye, Xu Tan, Da Wang, Lunkai Zhang, Wen-Ming Li, Zhi-Min Zhang, Dong-Rui Fan, Ning-Hui Sun. An Efficient Network-on-Chip Router for Dataflow Architecture[J]. Journal of Computer Science and Technology, 2017, 32(1): 11-25. DOI: 10.1007/s11390-017-1703-5
Citation: Xiao-Wei Shen, Xiao-Chun Ye, Xu Tan, Da Wang, Lunkai Zhang, Wen-Ming Li, Zhi-Min Zhang, Dong-Rui Fan, Ning-Hui Sun. An Efficient Network-on-Chip Router for Dataflow Architecture[J]. Journal of Computer Science and Technology, 2017, 32(1): 11-25. DOI: 10.1007/s11390-017-1703-5

An Efficient Network-on-Chip Router for Dataflow Architecture

Funds: This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2015AA01A301, the National Natural Science Foundation of China under Grant No. 61332009, the National HeGaoJi Project of China under Grant No. 2013ZX0102-8001-001-001, and the Beijing Municipal Science and Technology Commission under Grant Nos. Z15010101009 and Z151100003615006.
More Information
  • Author Bio:

    Xiao-Wei Shen received his Bachelor's degree in computer science and technology from the School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, in 2010. He is currently a Ph. D. candidate in University of Chinese Academy of Sciences, Beijing. His main research interests include processor micro-architecture and high-performance computer systems.

  • Corresponding author:

    Xiao-Chun Ye E-mail: yexiaochun@ict.ac.cn

  • Received Date: September 01, 2016
  • Revised Date: December 12, 2016
  • Published Date: January 04, 2017
  • Dataflow architecture has shown its advantages in many high-performance computing cases. In dataflow computing, a large amount of data are frequently transferred among processing elements through the network-on-chip (NoC). Thus the router design has a significant impact on the performance of dataflow architecture. Common routers are designed for control-flow multi-core architecture and we find they are not suitable for dataflow architecture. In this work, we analyze and extract the features of data transfers in NoCs of dataflow architecture:multiple destinations, high injection rate, and performance sensitive to delay. Based on the three features, we propose a novel and efficient NoC router for dataflow architecture. The proposed router supports multi-destination; thus it can transfer data with multiple destinations in a single transfer. Moreover, the router adopts output buffer to maximize throughput and adopts non-flit packets to minimize transfer delay. Experimental results show that the proposed router can improve the performance of dataflow architecture by 3.6x over a state-of-the-art router.
  • [1]
    Chen T S, Du Z D, Sun N H, Wang J, Wu C Y, Chen Y J, Temam O. DianNao:A small-footprint high-throughput ac celerator for ubiquitous machine-learning. In Proc. the 19th International Conference on Architectural Support for Pro gramming Languages and Operating Systems, Mar. 2014, pp.269-284.
    [2]
    Liu D F, Chen T S, Liu S L, Zhou J H, Zhou S Y, Temam O, Feng X B, Zhou X H, Chen Y J. PuDianNao:A poly valent machine learning accelerator. In Proc. the 20th In ternational Conference on Architectural Support for Pro gramming Languages and Operating Systems, Mar. 2014, pp.369-381.
    [3]
    Voitsechov D, Etsion Y. Single-graph multiple flows:En ergy efficient design alternative for GPGPUs. In Proc. the 41st Int. Symp. Computer Architecture, Jun. 2014, pp.205-216.
    [4]
    Oriato D, Tilbury S, Marrocu M, Pusceddu G. Acceleration of a meteorological limited area model with dataflow en gines. In Proc. the Symp. Application Accelerators in High Performance Computing, Jul. 2012, pp.129-132.
    [5]
    Pratas F, Oriato D, Pell O, Mata R A, Sousa L. Accelerat ing the computation of induced dipoles for molecular me chanics with dataflow engines. In Proc. the 21st Annual Int. Symp. Field-Programmable Custom Computing Machines, Apr. 2013, pp.177-180.
    [6]
    Fu H H, Gan L, Clapp R G, Ruan H B, Pell O, Mencer O, Flynn M, Huang X M, Yang G W. Scaling reverse time migration performance through reconfigurable dataflow en gines. IEEE Micro, 2014, 34(1):30-40.
    [7]
    Theobald K B. EARTH:An efficient architecture for run ning threads[Ph.D. Thesis]. McGill University, Montreal, Que., Canada, 1999.
    [8]
    Milutinovic V, Salom J, Trifunovic N, Giorgi R. Guide to Dataflow Supercomputing (1st edition). Springer Interna tional Publishing, 2015.
    [9]
    Sankaralingam K, Nagarajan R, McDonald R, Desikan R, Drolia S, Govindan M S, Gratz P, Gulati D, Hanson H, Kim C, Liu H M, Ranganathan N, Sethumadhavan S, Sharif S, Shivakumar P, Keckler S W, Burger D. Distributed microar chitectural protocols in the TRIPS prototype processor. In Proc. the 39th Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2006, pp.480-491.
    [10]
    Burger D, Keckler S W, McKinley K S, Dahlin M, John L K, Lin C, Moore C R, Burrill J, McDonald R G, Yoder W. Scaling to the end of silicon with EDGE architectures. Computer, 2004, 37(7):44-55.
    [11]
    Swanson S, Schwerin A, Mercaldi M, Petersen A, Putnam A, Michelson K, Oskin M, Eggers S J. The WaveScalar architecture. ACM Transactions on Computer Systems, 2007, 25(2):Article No.4.
    [12]
    Roca A, Flich J, Silla F, Duato J. A latency-efficient router architecture for CMP systems. In Proc. the 13th Euromicro Conference on Digital System Design:Architectures, Methods and Tools, Sept. 2010, pp.165-172.
    [13]
    Michelogiannakis G, Dally W J. Router designs for elastic buffer on-chip networks. In Proc. the Conference on High Performance Computing Networking, Storage and Analysis, Nov. 2009.
    [14]
    Chang Y Y, Huang Y S, Poremba M, Narayanan V K, Xie Y, King C T. TS-Router:On maximizing the qualityof-allocation in the on-chip network. In Proc. the 19th Int. Symp. High Performance Computer Architecture, Feb. 2013, pp.390-399.
    [15]
    Tran A T, Baas B M. Achieving high-performance on-chip networks with shared-buffer routers. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2014, 22(6):1391-1403.
    [16]
    Poluri P, Louri A. An improved router design for reliable on-chip networks. In Proc. the 28th Int. Parallel and Distributed Processing Symp., May 2014, pp.283-292.
    [17]
    Ben-Itzhak Y, Cidon I, Kolodny A, Shabun M, Shmuel N. Heterogeneous NoC router architecture. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(6):2479-2492.
    [18]
    Zoni D, Flich J, Fornaciari W. CUTBUF:Buffer management and router design for traffic mixing in VNET-based NoCs. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(6):1603-1616.
    [19]
    Singh W, Deb S. Energy efficient and congestion-aware router design for future NoCs. In Proc. the 29th Int. Conference on VLSI Design, Jan. 2016, pp.81-85.
    [20]
    Yan P Z, Jiang S X, Sridhar R. A high throughput router with a novel switch allocator for network on chip. In Proc. the 28th International System-on-Chip Conference, Sept. 2015, pp.160-163.
    [21]
    Xu Y, Zhao B, Zhang Y T, Yang J. Simple virtual channel allocation for high throughput and high frequency on-chip routers. In Proc. the 16th Int. Symp. High Performance Computer Architecture, Jan. 2010, pp.1-11.
    [22]
    Soteriou V, Ramanujam R S, Lin B, Peh L S. A highthroughput distributed shared-buffer NoC router. IEEE Computer Architecture Letters, 2009, 8(1):21-24.
    [23]
    Gu L, Li M, Siegel J. An empirically tuned 2D and 3D FFT library on CUDA GPU. In Proc. the 24th ACM International Conference on Supercomputing, Jun. 2010, pp.305-314.
    [24]
    Zhang Y P, Mueller F. Autogeneration and autotuning of 3D stencil codes on homogeneous and heterogeneous GPU clusters. IEEE Transactions on Parallel and Distributed Systems, 2013, 24(3):417-427.
    [25]
    Kuzak J, Tomov S, Dongarra J. Autotuning GEMM kernels for the Fermi GPU. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(11):2045-2057.
    [26]
    Hesse R, Nicholls J, Jerger N E. Fine-grained bandwidth adaptivity in networks-on-chip using bidirectional channels. In Proc. the 6th IEEE/ACM Int. Symp. Networks-on-Chip, May 2012, pp.132-141.
    [27]
    Ye X C, Fan D R, Sun N H, Tang S B, Zhang M Z, Zhang H. SimICT:A fast and flexible framework for performance and power evaluation of large-scale architecture. In Proc. the Int. Symp. Low Power Electronics and Design, Sept. 2013, pp.273-278.
    [28]
    Li S, Ahn J H, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT:An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2009, pp.469-480.
    [29]
    Solinas M, Badia R M, Bodin F, Cohen A, Evripidou P, Faraboschi P, Fenchner B, Gao G R, Garbade A, Girbal S, Goodman D, Khan B, Koliai S, Li F, Luján M, Morin L, Mendelson A, Navarro N, Pop A, Trancoso P, Ungerer T, Valero M, Weis S, Watson I, Zuckermann S, Giorgi R. The TERAFLUX project:Exploiting the dataflow paradigm in next generation teradevices. In Proc. the Euromicro Conference on Digital System Design, Sept. 2013, pp.272-279.
    [30]
    Carter N P, Agrawal A, Borkar S, Cledat R, David H, Dunning D, Fryman J, Ganev I, Golliver R A, Knauerhase R, Lethin R, Meister B, Mishra A K, Pinfold W R, Teller J, Torrellas J, Vasilache N, Venkatesh G, Xu J P. Runnemede:An architecture for ubiquitous high-performance computing. In Proc. the 19th Int. Symp. High Performance Computer Architecture, Feb. 2013, pp.198-209.
    [31]
    Wei L, Zhou L. An equilibrium partitioning method for multicast traffic in 3D NoC architecture. In Proc. the IFIP/IEEE International Conference on Very Large Scale Integration, Oct. 2015, pp.128-133.
    [32]
    Agrawal M, Chakrabarty K. Test-time optimization in NOC-based manycore SOCs using multicast routing. In Proc. the 32nd IEEE VLSI Test Symposium, Apr. 2014.
    [33]
    Kamali M, Petre L, Sere K, Daneshtalab M. Formal modeling of multicast communication in 3D NoCs. In Proc. the 14th Euromicro Conference on Digital System Design, Aug. 31-Sept. 2, 2011, pp.634-642.
    [34]
    Zhan J, Ouyang J, Ge F, Zhao J S, Xie Y. Hybrid drowsy SRAM and STT-RAM buffer designs for dark-silicon-aware NoC. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2016, 24(10):3041-3054.
    [35]
    Zhan J, Ouyang J, Ge F, Zhao S, Xie Y. DimNoC:A dim silicon approach towards power-efficient on-chip network. In Proc. the 52nd ACM/EDAC/IEEE Design Automation Conference, Jun. 2015.
    [36]
    Zhang L K, Strukov D, Saadeldeen H, Fan D R, Zhang M Z, Franklin D. SpongeDirectory:Flexible sparse directories utilizing multi-level memristors. In Proc. the 23rd Int. Conf. Parallel Architectures and Compilation, Aug. 2014, pp.61-74.
    [37]
    Deng Z X, Zhang L K, Franklin D, Chong F T. Herniated hash tables:Exploiting multi-level phase change memory for in-place data expansion. In Proc. the Int. Symp. Memory Systems, Oct. 2015, pp.247-257.
    [38]
    Zhang M Z, Zhang L K, Jiang L, Liu Z Y, Chong F T. Balancing performance and lifetime of MLC PCM by using a regionretention monitor. In Proc. the 23rd Int. Symp. High Performance Computer Architecture, Feb. 2017. (to be appeared)
    [39]
    LeeH H S, Tyson G S, Farrens M K. Eager writebacka technique for improving bandwidth utilization. In Proc. the 33rd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2000, pp.11-21.
    [40]
    Zhang L K, Neely B, Franklin D, Strukov D, Xie Y, Chong F T. Mellow writes:Extending lifetime in resistive memories through selective slow write backs. In Proc. the 43rd Int. Symp. Computer Architecture, Jun. 2016, pp.519-531. 314.
  • Related Articles

    [1]Xiao-Hui Wei, Shi-Yu Tong, Zhong-Ao Sun, Xiang Li, Heng-Shan Yue. ResCheckpointer: Building Program Error Resilience-Aware Checkpointing Mechanism for HPC Systems[J]. Journal of Computer Science and Technology, 2025, 40(3): 671-685. DOI: 10.1007/s11390-025-4634-6
    [2]Yu-Jin Yan, Hai-Bo Li, Tong Zhao, Lin-Wang Wang, Lin Shi, Tao Liu, Guang-Ming Tan, Wei-Le Jia, Ning-Hui Sun. 10-Million Atoms Simulation of First-Principle Package LS3DF[J]. Journal of Computer Science and Technology, 2024, 39(1): 45-62. DOI: 10.1007/s11390-023-3011-6
    [3]Zi-Xuan Ma, Yu-Yang Jin, Shi-Zhi Tang, Hao-Jie Wang, Wei-Cheng Xue, Ji-Dong Zhai, Wei-Min Zheng. Unified Programming Models for Heterogeneous High-Performance Computers[J]. Journal of Computer Science and Technology, 2023, 38(1): 211-218. DOI: 10.1007/s11390-023-2888-4
    [4]Rong Ge, Xizhou Feng, Pengfei Zou, Tyler Allen. The Paradigm of Power Bounded High-Performance Computing[J]. Journal of Computer Science and Technology, 2023, 38(1): 87-102. DOI: 10.1007/s11390-023-2885-7
    [5]Michèle Weiland, Bernhard Homölle. Usage Scenarios for Byte-Addressable Persistent Memory in High-Performance and Data Intensive Computing[J]. Journal of Computer Science and Technology, 2021, 36(1): 110-122. DOI: 10.1007/s11390-020-0776-8
    [6]Hong-Mei Wei, Jian Gao, Peng Qing, Kang Yu, Yan-Fei Fang, Ming-Lu Li. MPI-RCDD: A Framework for MPI Runtime Communication Deadlock Detection[J]. Journal of Computer Science and Technology, 2020, 35(2): 395-411. DOI: 10.1007/s11390-020-9701-4
    [7]Robert B. Ross, George Amvrosiadis, Philip Carns, Charles D. Cranor, Matthieu Dorier, Kevin Harms, Greg Ganger, Garth Gibson, Samuel K. Gutierrez, Robert Latham, Bob Robey, Dana Robinson, Bradley Settlemyer, Galen Shipman, Shane Snyder, Jerome Soumagne, Qing Zheng. Mochi: Composing Data Services for High-Performance Computing Environments[J]. Journal of Computer Science and Technology, 2020, 35(1): 121-144. DOI: 10.1007/s11390-020-9802-0
    [8]Qi Chen, Kang Chen, Zuo-Ning Chen, Wei Xue, Xu Ji, Bin Yang. Lessons Learned from Optimizing the Sunway Storage System for Higher Application I/O Performance[J]. Journal of Computer Science and Technology, 2020, 35(1): 47-60. DOI: 10.1007/s11390-020-9798-5
    [9]André Brinkmann, Kathryn Mohror, Weikuan Yu, Philip Carns, Toni Cortes, Scott A. Klasky, Alberto Miranda, Franz-Josef Pfreundt, Robert B. Ross, Marc-André Vef. Ad Hoc File Systems for High-Performance Computing[J]. Journal of Computer Science and Technology, 2020, 35(1): 4-26. DOI: 10.1007/s11390-020-9801-1
    [10]Xu Tan, Xiao-Wei Shen, Xiao-Chun Ye, Da Wang, Dong-Rui Fan, Lunkai Zhang, Wen-Ming Li, Zhi-Min Zhang, Zhi-Min Tang. A Non-Stop Double Buffering Mechanism for Dataflow Architecture[J]. Journal of Computer Science and Technology, 2018, 33(1): 145-157. DOI: 10.1007/s11390-017-1747-6
  • Cited by

    Periodical cited type(6)

    1. Ayaz H. Khan, Hamed Al-Mehdhar. Memory Pooling for Enhanced Data Loading in GPU-Accelerated Environments. IEEE Access, 2025, 13: 87175. DOI:10.1109/ACCESS.2025.3570500
    2. Md Tohidul Islam, Md Rakibul Islam, Md Sabbir Faruque, et al. Comparative Stock Performance Analysis of Leading Electric Vehicle Brands: Tesla, BYD, and NIO Using Python Programming Language. European Journal of Theoretical and Applied Sciences, 2024, 2(4): 327. DOI:10.59324/ejtas.2024.2(4).27
    3. Zhongping Zhang, Yuehan Hou, Daoheng Liu, et al. HGOD: Outlier detection based on a hybrid graph. Neurocomputing, 2024, 602: 128288. DOI:10.1016/j.neucom.2024.128288
    4. Yunfei Yin, Caihao Huang, Xianjian Bao. ContrAttNet: Contribution and attention approach to multivariate time-series data imputation. Network: Computation in Neural Systems, 2024. DOI:10.1080/0954898X.2024.2360157
    5. Jie Mao. Application of TOPSIS Algorithm in Tax Online Filing System. 2023 3rd International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON), DOI:10.1109/SMARTGENCON60755.2023.10442846
    6. Miao Chen, Zhenghui Zhao. Optimization of Deep Learning Models for Non-stationary Time Series Data. 2024 International Conference on Intelligent Algorithms for Computational Intelligence Systems (IACIS), DOI:10.1109/IACIS61494.2024.10721817

    Other cited types(0)

Catalog

    Article views (83) PDF downloads (1991) Cited by(6)
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return