›› 2015, Vol. 30 ›› Issue (1): 145-162.doi: 10.1007/s11390-015-1510-9

Special Issue: Computer Architecture and Systems

• Computer Architecture and Systems • Previous Articles     Next Articles

Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture

Fang Zheng(郑方), Member, CCF, Hong-Liang Li(李宏亮), Member, CCF, Hui Lv(吕晖), Member, CCF, Feng Guo(过锋), Member, CCF, Xiao-Hong Xu(许晓红), Member, CCF, Xiang-Hui Xie(谢向辉), Senior Member, CCF   

  1. State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214125, China
  • Received:2013-11-13 Revised:2014-10-07 Online:2015-01-05 Published:2015-01-05
  • About author:Fang Zheng received his M.S. degree in computer science from National Research Center of Parallel Computer Engineering and Technology, Beijing. Currently he is a Ph.D. candidate in computer science of State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi. His research interests include high performance computing and processors architecture.
  • Supported by:

    The work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2014AA01A300 and the National Science and Technology Major Project of HeGaoJi under Grant No. 2013ZX0102-8001-001-001.

Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing elements (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.

[1] Manferdelli J L, Govindaraju N K, Crall C. Challenges and opportunities in many-core computing. Proceedings of the IEEE, 2008, 96(5): 808-815.

[2] Shalf J, Dosanjh S, Morrison J. Exascale computing technology challenges. In Proc. the 9th Int. High Performance Computing for Computational Science{VECPAR, June 2011, pp.1-25.

[3] Daga M, Aji A M, Feng W. On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing. In Proc. Symposium on Application Accelerators in HighPerformance Computing, July 2011, pp.141-149.

[4] Chung E S, Milder P A, Hoe J C, Mai K. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPGPUs? In Proc. the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010, pp.225-236.

[5] Lee V W, Grochowski E, Geva R. Performance benefits of heterogeneous computing in HPC workloads. In Proc. the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), May 2012, pp.16-26.

[6] Kumar R, Farkas K I, Jouppi N P et al. Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Proc. the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2003, pp.81-92.

[7] Lee V W, Kim C, Chhugani J et al. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proc. the 37th Annual International Symposium on Computer Architecture (ISCA), June 2010, pp. 451-460.

[8] Wittenbrink C M, Kilgariff E, Prabhu A. Fermi GF100 GPU architecture. IEEE Micro, 2011, 31(2): 50-59.

[9] Kapasi U J, Dally W J, Rixner S et al. The imagine stream processor. In Proc. IEEE International Conference on Computer Design: VLSI in Computers and Processors(ICCD), September 2002, pp. 282-288.

[10] Duran A, Klemm M. The Intel® many integrated core architecture. In Proc. International Conference on High Performance Computing and Simulation (HPCS), July 2012, pp. 365-366.

[11] Alves M A Z, Freitas H C, Navaux P O A. Investigation of shared L2 cache on many-core processors. In Proc. the 22nd International Conference on Architecture of Computing Systems (ARCS), March 2009, pp. 1-10.

[12] Wentzlaff D, Griffin P, Hoffmann H et al. On-chip interconnection architecture of the Tile Processor. IEEE Micro, 2007, 27(5): 15-31.

[13] Howard J, Dighe S, Hoskote Y et al. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proc. IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Feb. 2010, pp.108-109.

[14] Hoskote Y, Vangal S, Singh A et al. A 5-GHz mesh interconnect for a Teraflops processor. IEEE Micro, 2007, 27(5): 51-61.

[15] Gries M, Hoffmann U, Konow M et al. SCC: A flexible architecture for many-core platform research. Computing in Science and Engineering, 2011, 13(6): 79-83.

[16] Balakrishnan A, Naeemi A. Interconnect network analysis of many-core chips. IEEE Transactions on Electron Devices, 2011, 58(9): 2831-2837.

[17] Taylor M B, Lee W, Amarasinghe S et al. Scalar operand networks: On-chip interconnect for ILP in partitioned architectures. In Proc. the 9th IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb. 2003, pp.341-353.

[18] Kim J. Low-cost router microarchitecture for on-chip networks. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2009, pp.255-266.

[19] Jung H, Ju M, Che H. A theoretical framework for design space exploration of manycore processors. In Proc. the 19th Annual IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, July 2011, pp.117-125.

[20] Seiler L, Carmean D, Sprangle E et al. Larrabee: A manycore x86 architecture for visual computing. IEEE Micro, 2009, 29(1): 10-21.

[21] Chen P, Zhao H L, Tao C, Sang H S. Block-run-based connected component labelling algorithm for GPGPU using shared memory. Electronics Letters, 2011, 47(24): 1309-1311.

[22] Sawant N, Kulkarni D. Performance evaluation of feature extraction algorithm on GPGPU. In Proc. International Conference on Communication Systems and Network Technologies (CSNT), June 2011, pp. 536-540.

[23] Heinecke A, Klemm M, Bungartz H J. From GPGPU to many-core: Nvidia Fermi and Intel many integrated core architecture. Computing in Science & Engineering, 2012, 14(2): 78-83.

[24] Bell S, Edwards B, Amann J et al. TILE64TM-processor: A 64-core SoC with mesh interconnect. In Proc. IEEE Int. Solid-State Circuits Conference (ISSCC), February 2008, pp.88-89, 598.

[25] Sewell K, Dreslinski R G, Manville T et al. Swizzle-switch networks for many-core systems. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2012, 2(2): 278-294.

[26] Kim J, Balfour J, Dally W. Flattened butterfly topology for on-chip networks. In Proc. the 40th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2007, pp.172-182.

[27] Bakhoda A, Kim J, Aamodt T M. Throughput-effective onchip networks for manycore accelerators. In Proc. the 43rd Annual IEEE/ACM MICRO, Dec. 2010, pp. 421-432.

[28] Fan D, Zhang H, Wang D et al. Godson-T: An efficient many-core processor exploring thread-level parallelism. IEEE Micro, 2012, 32(2): 38-47.

[29] Wang X, Gan G, Manzano J et al. A quantitative study of the on-chip network and memory hierarchy design for manycore processor. In Proc. the 14th IEEE International Conference on Parallel and Distributed Systems, Dec. 2008, pp. 689-696.

[30] Taylor M B, Psota J, Saraf A et al. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In Proc. the 31st Annual International Symposium on Computer Architecture (ISCA), June 2004, pp. 2-13.

[31] Taylor M B, Kim J, Miller J et al. The Raw microprocessor: A computational fabric for software circuits and generalpurpose programs. IEEE Micro, 2002, 22(2): 25-35.

[32] Bakhoda A, Kim J, Aamodt T M. Throughput-effective onchip networks for manycore accelerators. In Proc. the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2010, pp.421-432.

[33] Asanovic K, Bodik R, Catanzaro B C et al. The landscape of parallel computing research: A view from Berkeley. Technical Report, UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.

[34] Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. Conference on High Performance Computing Networking, Storage and Analysis, November 2009, Article No.18.

[35] Choi J W, Singh A, Vuduc R. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proc. the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, January 2010, pp.115-126.

[36] Luo L, Wong M, Hwu W. An effective GPU implementation of breadth-first search. In Proc. the 47th Design Automation Conference (DAC), June 2010, pp.52-55.

[37] Bo Z, Zheng-hui X, Wu R et al. Accelerating FDTD algorithm using GPU computing. In Proc. IEEE International Conference on Microwave Technology & Computational Electromagnetics, May 2011, pp.410-413.

[38] Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach (5th edition): Morgan Kaufmann, 2011.

[39] Hill M, Marty M. Amdahl's law in the multicore era. IEEE Computer, 2008, 41(7): 33-38.

[40] Riley M W, Warnock J D, Wendel D F. Cell broadband engine processor: Design and implementation. IBM Journal of Research and Development, 2007, 51(5): 545-557.

[41] Kahle J A, Day M N, Hofstee H P et al. Introduction to the Cell multiprocessor. IBM Journal Research and Development, 2005, 49(4): 589-604.

[42] Woo D H, Lee H H S. Extending Amdahl's law for energyefficient computing in the many-core era. IEEE Computer, 2008, 41(12): 24-31.

[43] Kumar R, Tullsen D M, Ranganathan P et al. SingleISA heterogeneous multicore architectures for multithreaded workload performance. In Proc. the 31st Annual International Symposium on Computer Architecture, June 2004, pp. 64-75.

[44] Yang Y, Xiang P, Mantor M et al. CPU-assisted GPGPU on fused CPU-GPU architectures. In Proc. the 18th IEEE International Symposium on High Performance Computer Architecture (HPCA), February 2012.

[45] Branover A, Foley D, Steinman M. AMD fusion APU: Llano. IEEE Micro, 2012, 32(2): 28-37.

[46] Keckler S W, Dally W J, Khailany B et al. GPUs and the future of parallel computing. IEEE Micro, 2011, 31(5): 7-17.

[47] Khunjush F, Dimopoulos N J. Extended characterization of DMA transfers on the Cell BE processor. In Proc. IEEE International Symposium on Parallel and Distributed Processing, April 2008.

[48] Gebhart M, Keckler S W, Khailany B et al. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.96-106.

[49] Keckler S W, Dally W J, Maskit D et al. Exploiting finegrain thread level parallelism on the MIT multi-ALU processor. ACM SIGARCH Computer Architecture News, 1998, 26(3): 306-317.

[50] Korch M, Rauber T, Scholtes C. Memory-intensive applications on a many-core processor. In Proc. the 13th IEEE International Conference on High Performance Computing and Communications (HPCC), September 2011, pp.126-134.

[51] Abellán J L, Fernández J, Acacio M E. Efficient hardware barrier synchronization in many-core CMPs. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(8): 1453-1466.

[52] WatkinsMA, Albonesi D H. ReMAP: A reconfigurable heterogeneous multicore architecture. In Proc. the 43rd IEEE International Symposium on Microarchitecture, Dec. 2010, pp. 497-508.

[53] Yu L, Liu Z, Fan D et al. Study on fine-grained synchronization in many-core architecture. In Proc. the 10th ACIS International Conference on Software Engineering, Arti cial Intelligences, Networking and Parallel/Distributed Computing, May 2009, pp.524-529.
No related articles found!
Full text



[1] Li Renwei;. Soundness and Completeness of Kung s Reasoning Procedure[J]. , 1988, 3(1): 7 -15 .
[2] Feng Yulin;. Hierarchical Protocol Analysis by Temporal Logic[J]. , 1988, 3(1): 56 -69 .
[3] Zhou Chaochen; Liu Xinxin;. Denote CSP with Temporal Formulas[J]. , 1990, 5(1): 17 -23 .
[4] Shen Xubang; Ma Guangti; Chen Lan;. An Inference Microprocessor Design[J]. , 1991, 6(3): 209 -213 .
[5] I.V.Vel bitsky; A.L.Kovalev; I.V.Kasatkina; Wang Lei;. R-Technology of Programming: Basic Notions and Implementation[J]. , 1992, 7(4): 345 -355 .
[6] Andrew I. Adamatzky;. Identification of Nonstationary Cellular Automata[J]. , 1992, 7(4): 379 -382 .
[7] Xu Dianxiang; Zheng Guoliang;. Towards a Declarative Semantics of Inheritance with Exceptions[J]. , 1996, 11(1): 61 -71 .
[8] Chen Yangjun;. Counting and Topological Order[J]. , 1997, 12(6): 497 -509 .
[9] Zhang Yin; Xu Zhuoqun;. Concurrent Manipulation of Expanded AVL Trees[J]. , 1998, 13(4): 325 -336 .
[10] NIE Xumin; GUO Qing;. Renaming a Set of Non-Horn Clauses[J]. , 2000, 15(5): 409 -415 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved