›› 2018,Vol. 33 ›› Issue (1): 145-157.doi: 10.1007/s11390-017-1747-6

所属专题: Computer Architecture and Systems

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

一种面向数据流架构的无停顿双缓冲机制

Xu Tan1,2, Student Member, CCF, Xiao-Wei Shen1,2, Xiao-Chun Ye1,3, Member, CCF, Da Wang1, Member, CCF, Dong-Rui Fan1,2,*, Senior Member, CCF, Lunkai Zhang4, Wen-Ming Li1, Member, CCF, Zhi-Min Zhang1, Senior Member, CCF, Zhi-Min Tang1, Distinguished Member, CCF   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;
    3 State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214125, China;
    4 Department of Computer Science, The University of Chicago, Chicago, IL 60637, U.S.A
  • 收稿日期:2016-09-02 修回日期:2017-03-13 出版日期:2018-01-05 发布日期:2018-01-05
  • 通讯作者: Dong-Rui Fan E-mail:fandr@ict.ac.cn
  • 作者简介:Xu Tan received his Bachelor's degree in computer science and technology from Capital Normal University, Beijing, in 2012. He is currently a Ph.D. candidate in Institute of Computing Technology, Chinese Academy of Sciences, Beijing. His main research interests include dataflow architecture and high-performance computer systems.
  • 基金资助:

    This work was supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200501, the National Natural Science Foundation of China under Grant Nos. 61332009 and 61521092, the Open Project Program of State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2016A04, and the Beijing Municipal Science and Technology Commission under Grant No. Z15010101009.

A Non-Stop Double Buffering Mechanism for Dataflow Architecture

Xu Tan1,2, Student Member, CCF, Xiao-Wei Shen1,2, Xiao-Chun Ye1,3, Member, CCF, Da Wang1, Member, CCF, Dong-Rui Fan1,2,*, Senior Member, CCF, Lunkai Zhang4, Wen-Ming Li1, Member, CCF, Zhi-Min Zhang1, Senior Member, CCF, Zhi-Min Tang1, Distinguished Member, CCF   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;
    3 State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214125, China;
    4 Department of Computer Science, The University of Chicago, Chicago, IL 60637, U.S.A
  • Received:2016-09-02 Revised:2017-03-13 Online:2018-01-05 Published:2018-01-05
  • Contact: Dong-Rui Fan E-mail:fandr@ict.ac.cn
  • About author:Xu Tan received his Bachelor's degree in computer science and technology from Capital Normal University, Beijing, in 2012. He is currently a Ph.D. candidate in Institute of Computing Technology, Chinese Academy of Sciences, Beijing. His main research interests include dataflow architecture and high-performance computer systems.
  • Supported by:

    This work was supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200501, the National Natural Science Foundation of China under Grant Nos. 61332009 and 61521092, the Open Project Program of State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2016A04, and the Beijing Municipal Science and Technology Commission under Grant No. Z15010101009.

双缓冲机制能够有效隐藏片外到片内存储传输的延迟,然而数据流架构中,双缓冲的切换会导致计算性能因频繁的排入排空而降低。本文提出了一种面向数据流架构的无停顿双缓冲机制,通过优化数据流结构中的控制逻辑,将数据块依次分配到处理单元阵列中而避免计算的停顿。同时,本文提出了一种工作流程序配合双缓冲机制。通过控制逻辑和工作流程序的优化,处理阵列排入排空数据只需在同一数据流图的多个数据块的计算过程中进行一次。实验表明,本文提出的面向数据流架构的双缓冲机制较优化前可取得16.2%的效率提升。

Abstract: Double buffering is an effective mechanism to hide the latency of data transfers between on-chip and off-chip memory. However, in dataflow architecture, the swapping of two buffers during the execution of many tiles decreases the performance because of repetitive filling and draining of the dataflow accelerator. In this work, we propose a non-stop double buffering mechanism for dataflow architecture. The proposed non-stop mechanism assigns tiles to the processing element array without stopping the execution of processing elements through optimizing control logic in dataflow architecture. Moreover, we propose a work-flow program to cooperate with the non-stop double buffering mechanism. After optimizations both on control logic and on work-flow program, the filling and draining of the array needs to be done only once across the execution of all tiles belonging to the same dataflow graph. Experimental results show that the proposed double buffering mechanism for dataflow architecture achieves a 16.2% average efficiency improvement over that without the optimization.

[1] Chen T S, Du Z D, Sun N H, Wang J, Wu C Y, Chen Y J, Temam O. DianNao:A small-foot print high-throughput accelerator for ubiquitous machine-learning. In Proc. the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2014, pp.269-284.

[2] Liu D F, Chen T S, Liu S L, Zhou J H, Zhou S Y, Temam O, Feng X B, Zhou X H, Chen Y J. PuDianNao:A polyvalent machine learning accelerator. In Proc. the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2014, pp.369-381.

[3] Voitsechov D, Etsion Y. Single-graph multiple flows:Energy efficient design alternative for GPGPUs. In Proc. the 41st Int. Symp. Computer Architecture, Jun. 2014, pp.205-216.

[4] Oriato D, Tilbury S, Marrocu M, Pusceddu G. Acceleration of a meteorological limited area model with dataflow engines. In Proc. the Int. Symp. Application Accelerators in High Performance Computing, Jul. 2012, pp.129-132.

[5] Pratas F, Oriato D, Pell O, Mata R A, Sousa L. Accelerating the computation of induced dipoles for molecular mechanics with dataflow engines. In Proc. the 21st Int. Symp. Field-Programmable Custom Computing Machines, Apr. 2013, pp.177-180.

[6] Fu H H, Gan L, Clapp R G, Ruan H B, Pell O, Mencer O, Flynn M, Huang X M, Yang G W. Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro, 2014, 34(1):30-40.

[7] Theobald K B. EARTH:An efficient architecture for running threads[Ph.D. Thesis]. McGill University, Montreal, Que., Canada, 1999.

[8] Milutinovic V, Salom J, Trifunovic N, Giorgi R. Guide to Dataflow Supercomputing (1st edition). Springer Press, 2015.

[9] Sancho J C, Kerbyson D J. Analysis of double buffering on two different multicore architectures:Quad-core Opteron and the Cell-BE. In Proc. the IEEE Int. Symp. Parallel and Distributed Processing, Apr. 2008.

[10] Che W J, Chatha K. Compilation of stream programs onto scratchpad memory based embedded multicore processors through retiming. In Proc. the 48th Design Automation Conference, Jun. 2011, pp.122-127.

[11] Saidi S, Tendulkar P, Lepley T, Maler O. Optimizing explicit data transfers for data parallel applications on the cell architecture. ACM Transactions on Architecture and Code Optimization, 2012, 8(4):Article No. 37.

[12] Deng Y, Wang L, Yan X B, Yang X J. A double-buffering strategy for the SRF management in the Imagine stream processor. In Proc. the 9th International Conference for Young Computer Scientists, Nov. 2008, pp.160-165.

[13] Zinner C, Kubinger W. ROS-DMA:A DMA double buffering method for embedded image processing with resource optimized slicing. In Proc. the 12th IEEE Real-Time and Embedded Technology and Applications Symp., Apr. 2006, pp.361-372.

[14] Bai Y W, Liu C C. The performance improvement of a photo card reader by the use of a high-integration chip solution with double FIFO buffers. IEEE Transactions on Consumer Electronics, 2005, 51(2):329-334.

[15] Li J, Han K P, Hong S, Luo S M, Dong Z J, Lu P. A prefetching method with double-buffer for multimedia streaming servers. In Proc. International Conference on Transportation, Mechanical and Electrical Engineering, Dec. 2011, pp.1485-1489.

[16] Singh H, Lee M H, Lu G M, Kurdahi F J, Bagherzadeh N, Filho E C. MorphoSys:An integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Transactions on Computers, 2000, 49(5):465-481.

[17] Zhang C, Li P, Sun G Y, Guan Y J, Xiao B J, Cong J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proc. the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Feb. 2015, pp.161-170.

[18] Shen X W, Ye X C, Tan X, Wang D, Lunkai Zhang, Li W M, Zhang Z M, Fan D R, Sun N H. An efficient network-onchip router for dataflow architecture. Journal of Computer Science and Technology, 2017, 32(1):11-25.

[19] Ye X C, Fan D R, Sun N H, Tang S B, Zhang M Z, Zhang H. SimICT:A fast and flexible framework for performance and power evaluation of large-scale architecture. In Proc. the Int. Symp. Low Power Electronics and Design, Sept. 2013, pp.273-278.

[20] Holewinski J, Pouchet L N, Sadayappan P. Highperformance code generation for stencil computations on GPU architectures. In Proc. the 26th ACM International Conference on Supercomputing, Jun. 2012, pp.311-320.

[21] Zhang Y P, Mueller F. Autogeneration and autotuning of 3D stencil codes on homogeneous and heterogeneous GPU clusters. IEEE Trans. Parallel and Distributed Systems, 2013, 24(3):417-427.

[22] Kuzak J, Tomov S, Dongarra J. Autotuning GEMM kernels for the Fermi GPU. IEEE Trans. Parallel and Distributed Systems, 2012, 23(11):2045-2057.

[23] Li S, Ahn J H, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT:An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2009, pp.469-480.

[24] Solinas M, Badia R M, Bodin F, Cohen A, Evripidou P, Faraboschi P, Fenchner B, Gao G R, Garbade A, Girbal S, Goodman D, Khan B, Koliai S, Li F, Luj'an M, Morin L, Mendelson A, Navarro N, Pop A, Trancoso P, Ungerer T, Valero M, Weis S, Watson I, Zuckermann S, Giorgi R. The TERAFLUX project:Exploiting the dataflow paradigm in next generation teradevices. In Proc. the Euromicro Conference on Digital System Design, Sept. 2013, pp.272-279.

[25] Carter N P, Agrawal A, Borkar S, Cledat R, David H, Dunning D, Fryman J, Ganev I, Golliver R A, Knauerhase R, Lethin R, Meister B, Mishra A K, Pinfold W R, Teller J, Torrellas J, Vasilache N, Venkatesh G, Xu J P. Runnemede:An architecture for ubiquitous high-performance computing. In Proc. the 19th Int. Symp. High Performance Computer Architecture, Feb. 2013, pp.198-209.

[26] Burger D, Keckler S W, McKinley K S, Dahlin M, John L K, Lin C, Moore C R, Burrill J, McDonald R G, Yoder W. Scaling to the end of silicon with EDGE architectures. Computer, 2004, 37(7):44-55.

[27] Swanson S, Schwerin A, Mercaldi M, Petersen A, Putnam A, Michelson K, Oskin M, Eggers S J. The WaveScalar architecture. ACM Transactions on Computer Systems, 2007, 25(2):Article No. 4.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] 李明慧;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: