›› 2018,Vol. 33 ›› Issue (1): 116-130.doi: 10.1007/s11390-017-1748-5

所属专题: Computer Architecture and Systems

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

一种面向数据流架构的流水循环优化方法

Xu Tan1,2, Student Member, CCF, Xiao-Chun Ye1,3, Member, CCF, Xiao-Wei Shen1,2, Yuan-Chao Xu1,4,*, Member, CCF, Da Wang1, Member, CCF, Lunkai Zhang5, Wen-Ming Li1, Member, CCF, Dong-Rui Fan1,2, Senior Member, CCF, Zhi-Min Tang1, Distinguished Member, CCF   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;
    3 State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214125, China;
    4 College of Information Engineering, Capital Normal University, Beijing 100048, China;
    5 Department of Computer Science, The University of Chicago, Chicago, IL 60637, U.S.A
  • 收稿日期:2016-09-04 修回日期:2017-04-17 出版日期:2018-01-05 发布日期:2018-01-05
  • 通讯作者: Yuan-Chao Xu E-mail:xuyuanchao@cnu.edu.cn
  • 作者简介:Xu Tan received his Bachelor's degree in computer science and technology from Capital Normal University, Beijing, in 2012. He is currently a Ph.D. candidate in Institute of Computing Technology, Chinese Academy of Sciences, Beijing. His main research interests include dataflow architecture and high-performance computer systems.
  • 基金资助:

    This work was supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200501, the National Natural Science Foundation of China under Grant Nos. 61332009 and 61521092, the Open Project Program of State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2016A04 and the Beijing Municipal Science and Technology Commission under Grant No. Z15010101009, the Open Project Program of State Key Laboratory of Computer Architecture under Grant No. CARCH201503, China Scholarship Council, and Beijing Advanced Innovation Center for Imaging Technology.

A Pipelining Loop Optimization Method for Dataflow Architecture

Xu Tan1,2, Student Member, CCF, Xiao-Chun Ye1,3, Member, CCF, Xiao-Wei Shen1,2, Yuan-Chao Xu1,4,*, Member, CCF, Da Wang1, Member, CCF, Lunkai Zhang5, Wen-Ming Li1, Member, CCF, Dong-Rui Fan1,2, Senior Member, CCF, Zhi-Min Tang1, Distinguished Member, CCF   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;
    3 State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214125, China;
    4 College of Information Engineering, Capital Normal University, Beijing 100048, China;
    5 Department of Computer Science, The University of Chicago, Chicago, IL 60637, U.S.A
  • Received:2016-09-04 Revised:2017-04-17 Online:2018-01-05 Published:2018-01-05
  • Contact: Yuan-Chao Xu E-mail:xuyuanchao@cnu.edu.cn
  • About author:Xu Tan received his Bachelor's degree in computer science and technology from Capital Normal University, Beijing, in 2012. He is currently a Ph.D. candidate in Institute of Computing Technology, Chinese Academy of Sciences, Beijing. His main research interests include dataflow architecture and high-performance computer systems.
  • Supported by:

    This work was supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200501, the National Natural Science Foundation of China under Grant Nos. 61332009 and 61521092, the Open Project Program of State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2016A04 and the Beijing Municipal Science and Technology Commission under Grant No. Z15010101009, the Open Project Program of State Key Laboratory of Computer Architecture under Grant No. CARCH201503, China Scholarship Council, and Beijing Advanced Innovation Center for Imaging Technology.

在未来计算场景下,能效将成为构建E级计算系统最大的障碍。数据流体系结构在处理科学应用时具备天然的能效优势,然而目前的数据流处理结构不能充分挖掘循环中的并行性。为了解决这一问题,我们提出了一种流水循环优化方法(PLO),它让不同迭代在处理阵列上同时流动,这种方法包含两种技术:硬件支持的硬迭代技术和指令支持的软迭代技术。在硬迭代执行模型中,片上循环控制器负责产生循环索引,简化了数据流图的复杂性同时为流水执行提供了良好的基础;在软迭代执行模型中,本文设计了循环指令来解决循环之间的依赖问题。通过这两种技术,处理阵列上同一时刻可被执行的指令数大大增加,使得浮点单元保持运转。模拟结果显示本文提出的方法达到的浮点效率比静态和动态执行模型分别高2.45倍和1.1倍,同时本文方法的硬件开销非常有限。

Abstract: With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for scientific applications. However, the state-of-the-art dataflow architectures fail to exploit high parallelism for loop processing. To address this issue, we propose a pipelining loop optimization method (PLO), which makes iterations in loops flow in the processing element (PE) array of dataflow accelerator. This method consists of two techniques, architecture-assisted hardware iteration and instruction-assisted software iteration. In hardware iteration execution model, an on-chip loop controller is designed to generate loop indexes, reducing the complexity of computing kernel and laying a good foundation for pipelining execution. In software iteration execution model, additional loop instructions are presented to solve the iteration dependency problem. Via these two techniques, the average number of instructions ready to execute per cycle is increased to keep floating-point unit busy. Simulation results show that our proposed method outperforms static and dynamic loop execution model in floating-point efficiency by 2.45x and 1.1x on average, respectively, while the hardware cost of these two techniques is acceptable.

[1] Tolentino M, Cameron K W. The optimist, the pessimist, and the global race to exascale in 20 megawatts. Computer, 2012, 45(1):95-97.

[2] Kogge P. The tops in flops. IEEE Spectrum, 2011, 48(2):48-54.

[3] Kogge P, Bergman K, Borkar S et al. ExaScale computing study:Technology challenges in achieving exascale systems. Technical Report TR-2008-13, Defense Advanced Research Projects Agency Information Processing Technigues Office, 2008. http://www.citeulike.org/group/11430/article/6638217, Dec. 2017.

[4] Milutinovi V, Salom J, Trifunovic N, Giorgi R. Guide to DataFlow Supercomputing:Basic Concepts, Case Studies, and a Detailed Example. Springer, 2015.

[5] Dennis J B. First version of a data flow procedure language. In Proc. the Programming Symp., April 1974, pp.362-376.

[6] Oriato D, Tilbury S, Marrocu M, Pusceddu G. Acceleration of a meteorological limited area model with dataflow engines. In Proc. Symp. Application Accelerators in High Performance Computing, July 2012, pp.129-132.

[7] Pratas F, Oriato D, Pell O, Mata R A, Sousa L. Accelerating the computation of induced dipoles for molecular mechanics with dataflow engines. In Proc. the 21st IEEE Annual Int. Symp. Field-Programmable Custom Computing Machines, April 2013, pp.177-180.

[8] Fu H H, Gan L, Clapp R G et al. Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro, 2014, 34(1):30-40.

[9] Ackerman W B, Dennis J B. VAL-A value-oriented algorithmic language:Preliminary reference manual. Technical Report TR-218, Computation Structure Group, Laboratory for Computer Science, MIT, 1979. http://citeseerx.ist.psu.edu/showciting?cid=928490, Dec. 2017.

[10] Burger D, Keckler S W, McKinley K S, Dahlin M, John L K, Lin C, Moore C R, Burrill J, McDonald R G, Yoder W. Scaling to the end of silicon with edge architectures. Computer, 2004, 37(7):44-55.

[11] Arvind N, Gostelow K, Plouffe W. An asynchronous programming language and computing machine. Technical Report TR114a, Department of Information and Computer Science, University of California, 1978.

[12] Arvind K, Nikhil R S. Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. Computers, 1990, 39(3):300-318.

[13] Swanson S, Schwerin A, Mercaldi M et al. The wavescalar architecture. ACM Trans. Computer Systems, 2007, 25(2):Article No. 4.

[14] Zuckerman S, Suetterlein J, Knauerhase R, Gao G R. Position paper:Using a "codelet" program execution model for exascale machines. In Proc. the 1st Int. Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era, June 2011, pp.64-69.

[15] Suettlerlein J, Zuckerman S, Gao G R. An implementation of the codelet model. In Proc. the 19th Int. Conf. Parallel Processing, August 2013, pp.633-644.

[16] Pell O, Averbukh V. Maximum performance computing with dataflow engines. Computing in Science & Engineering, 2012, 14(4):98-103.

[17] Voitsechov D, Etsion Y. Single-graph multiple flows:Energy efficient design alternative for GPGPUs. ACM SIGARCH Computer Architecture News, 2014, 42(3):205-216.

[18] Gurd J R, Kirkham C C, Watson I. The Manchester prototype dataflow computer. Communications of the ACM, 1985, 28(1):34-52.

[19] Shen X W, Ye X C, Tan X, Wang D, Zhang L K, Li W M, Zhang Z M, Fan D R, Sun N H. An efficient network-onchip router for dataflow architecture. Journal of Computer Science and Technology, 2017, 32(1):11-25.

[20] Tan X, Shen X W, Ye X C, Wang D, Fan D R, Zhang L K, Li W M, Zhang Z M, Tang Z M. A non-stop double buffering mechanism for dataflow architecture. Journal of Computer Science and Technology, 2018, 33(1):145-157.

[21] Ye X C, Fan D R, Sun N H et al. SimICT:A fast and flexible framework for performance and power evaluation of large-scale architecture. In Proc. Int. Symp. Low Power Electronics and Design, September 2013, pp.273-278.

[22] Nguyen A, Satish N, Chhugani J et al. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proc. ACM/IEEE Int. Conf. for High Performance Computing Networking Storage and Analysis, Nov. 2010.

[23] Kurzak J, Tomov S, Dongarra J. Autotuning GEMM kernels for the Fermi GPU. IEEE Trans. Parallel and Distributed Systems, 2012, 23(11):2045-2057.

[24] Zhang Y P, Mueller F. Autogeneration and autotuning of 3D stencil codes on homogeneous and heterogeneous GPU clusters. IEEE Trans. Parallel and Distributed Systems, 2013, 24(3):417-427.

[25] del Mundo C, Feng W C. Towards a performance-portable FFT library for heterogeneous computing. In Proc. the 11th ACM Conf. Computing Frontiers, May 2014, Article No.11.

[26] Li S, Ahn J H, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT:An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2009, pp.469-480.

[27] Naffziger S. High-performance processors in a power-limited world. In Proc. Symp. VLSI Circuits Digest of Technical Papers, June 2006, pp.93-97.

[28] Solinas M, Badia R M, Bodin F et al. The TERAFLUX project:Exploiting the dataflow paradigm in next generation teradevices. In Proc. Euromicro Conf. Digital System Design, September 2013, pp.272-279.

[29] Carter N P, Agrawal A, Borkar S et al. Runnemede:An architecture for ubiquitous high-performance computing. In Proc. the 19th IEEE Int. Symp. High Performance Computer Architecture, February 2013, pp.198-209.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] . 无线自主网中的链路分配最优化算法[J]. , 2006, 21(1): 89 -94 .
[2] . 基于转发和控制分离的开放可编程路由器的分析和实现[J]. , 2008, 23(5 ): 769 -779 .
[3] . 一种有效的动作影片中打斗镜头的识别方法[J]. , 2011, 26(1): 187 -194 .
[4] Zi-Chao Xie, Dong Tong, Ming-Kai Huang . 一种通用的使用目标地址指针的低开销间接转移预测技术[J]. , 2014, 29(6): 929 -946 .
[5] Kai Dong, Tao Gu, Xian-Ping Tao, Jian Lv. 完全两分匿名实现位置隐私[J]. , 2014, 29(6): 1094 -1110 .
[6] Cinzia Bernardeschi, Luca Cassano, Andrea Domenici. 针对安全攸关应用的基于SRAM的FPGA系统:关于设计标准和方法的综述[J]. , 2015, 30(2): 373 -390 .
[7] Zhen Geng, Zheng Shi, Xiao-Lang Yan, Kai-Sheng Luo, Wei-Wei Pan. 基于水平集的增强工艺鲁棒性反向光刻快速算法及其应用[J]. , 2015, 30(3): 629 -638 .
[8] Lixue Xia, Peng Gu, Boxun Li, Tianqi Tang, Xiling Yin, Wenqin Huangfu, Shimeng Yu, Yu Cao, Yu Wang, Huazhong Yang. 忆阻器阵列矩阵向量乘的设计空间优化[J]. , 2016, 31(1): 3 -19 .
[9] Bin Liu, Kun Xu, Ralph R. Martin. 视频中静态场景光照估计及应用[J]. , 2017, 32(3): 430 -442 .
[10] Bei-Ji Zou, Yao Chen, Cheng-Zhang Zhu, Zai-Liang Chen, Zi-Qian Zhang. 基于特征选择的有监督视网膜血管动静脉分类[J]. , 2017, 32(6): 1222 -1230 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: