›› 2015,Vol. 30 ›› Issue (1): 145-162.doi: 10.1007/s11390-015-1510-9

所属专题: Computer Architecture and Systems

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

面向深度融合异构众核处理器的协同计算技术

Fang Zheng(郑方), Member, CCF, Hong-Liang Li(李宏亮), Member, CCF, Hui Lv(吕晖), Member, CCF, Feng Guo(过锋), Member, CCF, Xiao-Hong Xu(许晓红), Member, CCF, Xiang-Hui Xie(谢向辉), Senior Member, CCF   

  1. State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214125, China
  • 收稿日期:2013-11-13 修回日期:2014-10-07 出版日期:2015-01-05 发布日期:2015-01-05
  • 作者简介:Fang Zheng received his M.S. degree in computer science from National Research Center of Parallel Computer Engineering and Technology, Beijing. Currently he is a Ph.D. candidate in computer science of State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi. His research interests include high performance computing and processors architecture.
  • 基金资助:

    The work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2014AA01A300 and the National Science and Technology Major Project of HeGaoJi under Grant No. 2013ZX0102-8001-001-001.

Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture

Fang Zheng(郑方), Member, CCF, Hong-Liang Li(李宏亮), Member, CCF, Hui Lv(吕晖), Member, CCF, Feng Guo(过锋), Member, CCF, Xiao-Hong Xu(许晓红), Member, CCF, Xiang-Hui Xie(谢向辉), Senior Member, CCF   

  1. State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214125, China
  • Received:2013-11-13 Revised:2014-10-07 Online:2015-01-05 Published:2015-01-05
  • About author:Fang Zheng received his M.S. degree in computer science from National Research Center of Parallel Computer Engineering and Technology, Beijing. Currently he is a Ph.D. candidate in computer science of State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi. His research interests include high performance computing and processors architecture.
  • Supported by:

    The work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2014AA01A300 and the National Science and Technology Major Project of HeGaoJi under Grant No. 2013ZX0102-8001-001-001.

随着半导体技术的进步,众核处理器已经广泛应用于高性能计算领域.但是由于"访存墙"的影响,很多应用在众核处理器上执行的过程中出现了性能瓶颈,处理器的计算的能力难以有效发挥.本文中,我们提出了一种面向高性能计算领域的新的深度融合异构众核处理器结构(DFMC,deeply fused many-core).DFMC片上集成了异构的管理核心(MPE)和计算核心(CPE),这两种核心面向不同的应用特征,但使用统一的指令集、统一的执行模型,并支持带Cache一致性的主存共享.为减轻"访存墙"影响,DFMC的CPE之间支持多种协同计算技术,包括多模式数据流传输、高效的寄存器通信技术和快速硬件同步技术.这些技术可以提高片上片上数据重用率并优化访存性能.本文实现了一个基于FPGA的全片原型系统,包括了4个管理核心和256个计算核心.实验结果表明,协同计算技术可以有效提高协同计算效率,DGEMM的效率达到94%,FFT性能达到207Gflops,FDTD性能为27Gflops.

Abstract: Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing elements (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.

[1] Manferdelli J L, Govindaraju N K, Crall C. Challenges and opportunities in many-core computing. Proceedings of the IEEE, 2008, 96(5): 808-815.

[2] Shalf J, Dosanjh S, Morrison J. Exascale computing technology challenges. In Proc. the 9th Int. High Performance Computing for Computational Science{VECPAR, June 2011, pp.1-25.

[3] Daga M, Aji A M, Feng W. On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing. In Proc. Symposium on Application Accelerators in HighPerformance Computing, July 2011, pp.141-149.

[4] Chung E S, Milder P A, Hoe J C, Mai K. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPGPUs? In Proc. the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010, pp.225-236.

[5] Lee V W, Grochowski E, Geva R. Performance benefits of heterogeneous computing in HPC workloads. In Proc. the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), May 2012, pp.16-26.

[6] Kumar R, Farkas K I, Jouppi N P et al. Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Proc. the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2003, pp.81-92.

[7] Lee V W, Kim C, Chhugani J et al. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proc. the 37th Annual International Symposium on Computer Architecture (ISCA), June 2010, pp. 451-460.

[8] Wittenbrink C M, Kilgariff E, Prabhu A. Fermi GF100 GPU architecture. IEEE Micro, 2011, 31(2): 50-59.

[9] Kapasi U J, Dally W J, Rixner S et al. The imagine stream processor. In Proc. IEEE International Conference on Computer Design: VLSI in Computers and Processors(ICCD), September 2002, pp. 282-288.

[10] Duran A, Klemm M. The Intel® many integrated core architecture. In Proc. International Conference on High Performance Computing and Simulation (HPCS), July 2012, pp. 365-366.

[11] Alves M A Z, Freitas H C, Navaux P O A. Investigation of shared L2 cache on many-core processors. In Proc. the 22nd International Conference on Architecture of Computing Systems (ARCS), March 2009, pp. 1-10.

[12] Wentzlaff D, Griffin P, Hoffmann H et al. On-chip interconnection architecture of the Tile Processor. IEEE Micro, 2007, 27(5): 15-31.

[13] Howard J, Dighe S, Hoskote Y et al. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proc. IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Feb. 2010, pp.108-109.

[14] Hoskote Y, Vangal S, Singh A et al. A 5-GHz mesh interconnect for a Teraflops processor. IEEE Micro, 2007, 27(5): 51-61.

[15] Gries M, Hoffmann U, Konow M et al. SCC: A flexible architecture for many-core platform research. Computing in Science and Engineering, 2011, 13(6): 79-83.

[16] Balakrishnan A, Naeemi A. Interconnect network analysis of many-core chips. IEEE Transactions on Electron Devices, 2011, 58(9): 2831-2837.

[17] Taylor M B, Lee W, Amarasinghe S et al. Scalar operand networks: On-chip interconnect for ILP in partitioned architectures. In Proc. the 9th IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb. 2003, pp.341-353.

[18] Kim J. Low-cost router microarchitecture for on-chip networks. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2009, pp.255-266.

[19] Jung H, Ju M, Che H. A theoretical framework for design space exploration of manycore processors. In Proc. the 19th Annual IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, July 2011, pp.117-125.

[20] Seiler L, Carmean D, Sprangle E et al. Larrabee: A manycore x86 architecture for visual computing. IEEE Micro, 2009, 29(1): 10-21.

[21] Chen P, Zhao H L, Tao C, Sang H S. Block-run-based connected component labelling algorithm for GPGPU using shared memory. Electronics Letters, 2011, 47(24): 1309-1311.

[22] Sawant N, Kulkarni D. Performance evaluation of feature extraction algorithm on GPGPU. In Proc. International Conference on Communication Systems and Network Technologies (CSNT), June 2011, pp. 536-540.

[23] Heinecke A, Klemm M, Bungartz H J. From GPGPU to many-core: Nvidia Fermi and Intel many integrated core architecture. Computing in Science & Engineering, 2012, 14(2): 78-83.

[24] Bell S, Edwards B, Amann J et al. TILE64TM-processor: A 64-core SoC with mesh interconnect. In Proc. IEEE Int. Solid-State Circuits Conference (ISSCC), February 2008, pp.88-89, 598.

[25] Sewell K, Dreslinski R G, Manville T et al. Swizzle-switch networks for many-core systems. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2012, 2(2): 278-294.

[26] Kim J, Balfour J, Dally W. Flattened butterfly topology for on-chip networks. In Proc. the 40th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2007, pp.172-182.

[27] Bakhoda A, Kim J, Aamodt T M. Throughput-effective onchip networks for manycore accelerators. In Proc. the 43rd Annual IEEE/ACM MICRO, Dec. 2010, pp. 421-432.

[28] Fan D, Zhang H, Wang D et al. Godson-T: An efficient many-core processor exploring thread-level parallelism. IEEE Micro, 2012, 32(2): 38-47.

[29] Wang X, Gan G, Manzano J et al. A quantitative study of the on-chip network and memory hierarchy design for manycore processor. In Proc. the 14th IEEE International Conference on Parallel and Distributed Systems, Dec. 2008, pp. 689-696.

[30] Taylor M B, Psota J, Saraf A et al. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In Proc. the 31st Annual International Symposium on Computer Architecture (ISCA), June 2004, pp. 2-13.

[31] Taylor M B, Kim J, Miller J et al. The Raw microprocessor: A computational fabric for software circuits and generalpurpose programs. IEEE Micro, 2002, 22(2): 25-35.

[32] Bakhoda A, Kim J, Aamodt T M. Throughput-effective onchip networks for manycore accelerators. In Proc. the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2010, pp.421-432.

[33] Asanovic K, Bodik R, Catanzaro B C et al. The landscape of parallel computing research: A view from Berkeley. Technical Report, UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.

[34] Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. Conference on High Performance Computing Networking, Storage and Analysis, November 2009, Article No.18.

[35] Choi J W, Singh A, Vuduc R. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proc. the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, January 2010, pp.115-126.

[36] Luo L, Wong M, Hwu W. An effective GPU implementation of breadth-first search. In Proc. the 47th Design Automation Conference (DAC), June 2010, pp.52-55.

[37] Bo Z, Zheng-hui X, Wu R et al. Accelerating FDTD algorithm using GPU computing. In Proc. IEEE International Conference on Microwave Technology & Computational Electromagnetics, May 2011, pp.410-413.

[38] Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach (5th edition): Morgan Kaufmann, 2011.

[39] Hill M, Marty M. Amdahl's law in the multicore era. IEEE Computer, 2008, 41(7): 33-38.

[40] Riley M W, Warnock J D, Wendel D F. Cell broadband engine processor: Design and implementation. IBM Journal of Research and Development, 2007, 51(5): 545-557.

[41] Kahle J A, Day M N, Hofstee H P et al. Introduction to the Cell multiprocessor. IBM Journal Research and Development, 2005, 49(4): 589-604.

[42] Woo D H, Lee H H S. Extending Amdahl's law for energyefficient computing in the many-core era. IEEE Computer, 2008, 41(12): 24-31.

[43] Kumar R, Tullsen D M, Ranganathan P et al. SingleISA heterogeneous multicore architectures for multithreaded workload performance. In Proc. the 31st Annual International Symposium on Computer Architecture, June 2004, pp. 64-75.

[44] Yang Y, Xiang P, Mantor M et al. CPU-assisted GPGPU on fused CPU-GPU architectures. In Proc. the 18th IEEE International Symposium on High Performance Computer Architecture (HPCA), February 2012.

[45] Branover A, Foley D, Steinman M. AMD fusion APU: Llano. IEEE Micro, 2012, 32(2): 28-37.

[46] Keckler S W, Dally W J, Khailany B et al. GPUs and the future of parallel computing. IEEE Micro, 2011, 31(5): 7-17.

[47] Khunjush F, Dimopoulos N J. Extended characterization of DMA transfers on the Cell BE processor. In Proc. IEEE International Symposium on Parallel and Distributed Processing, April 2008.

[48] Gebhart M, Keckler S W, Khailany B et al. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.96-106.

[49] Keckler S W, Dally W J, Maskit D et al. Exploiting finegrain thread level parallelism on the MIT multi-ALU processor. ACM SIGARCH Computer Architecture News, 1998, 26(3): 306-317.

[50] Korch M, Rauber T, Scholtes C. Memory-intensive applications on a many-core processor. In Proc. the 13th IEEE International Conference on High Performance Computing and Communications (HPCC), September 2011, pp.126-134.

[51] Abellán J L, Fernández J, Acacio M E. Efficient hardware barrier synchronization in many-core CMPs. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(8): 1453-1466.

[52] WatkinsMA, Albonesi D H. ReMAP: A reconfigurable heterogeneous multicore architecture. In Proc. the 43rd IEEE International Symposium on Microarchitecture, Dec. 2010, pp. 497-508.

[53] Yu L, Liu Z, Fan D et al. Study on fine-grained synchronization in many-core architecture. In Proc. the 10th ACIS International Conference on Software Engineering, Arti cial Intelligences, Networking and Parallel/Distributed Computing, May 2009, pp.524-529.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 黎仁蔚;. Soundness and Completeness of Kung s Reasoning Procedure[J]. , 1988, 3(1): 7 -15 .
[2] 冯玉琳;. Hierarchical Protocol Analysis by Temporal Logic[J]. , 1988, 3(1): 56 -69 .
[3] 周巢尘; 柳欣欣;. Denote CSP with Temporal Formulas[J]. , 1990, 5(1): 17 -23 .
[4] 沈绪榜; 马光悌; 陈岚;. An Inference Microprocessor Design[J]. , 1991, 6(3): 209 -213 .
[5] I.V.Vel bitsky; A.L.Kovalev; I.V.Kasatkina; 王镭;. R-Technology of Programming: Basic Notions and Implementation[J]. , 1992, 7(4): 345 -355 .
[6] Andrew I. Adamatzky;. Identification of Nonstationary Cellular Automata[J]. , 1992, 7(4): 379 -382 .
[7] 徐殿祥; 郑国梁;. Towards a Declarative Semantics of Inheritance with Exceptions[J]. , 1996, 11(1): 61 -71 .
[8] 陈阳军;. Counting and Topological Order[J]. , 1997, 12(6): 497 -509 .
[9] 章寅; 许卓群;. Concurrent Manipulation of Expanded AVL Trees[J]. , 1998, 13(4): 325 -336 .
[10] 聂旭民; 郭青;. Renaming a Set of Non-Horn Clauses[J]. , 2000, 15(5): 409 -415 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: