›› 2011, Vol. 26 ›› Issue (5): 854-865.doi: 10.1007/s11390-011-0184-1

Special Issue: Computer Architecture and Systems

• Architecture and High Performance Computer Systems • Previous Articles     Next Articles

Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

Feng Wang (王锋) Member, CCF, ACM, Can-Qun Yang (杨灿群), Yun-Fei Du (杜云飞), Juan Chen (陈娟), Hui-Zhan Yi (易会战), and Wei-Xia Xu (徐炜遐)   

  1. School of Computer Science, National University of Defense Technology, Changsha 410073, China
  • Received:2010-11-24 Revised:2011-06-15 Online:2011-09-05 Published:2011-09-05
  • Contact: Feng Wang E-mail:fengwang@nudt.edu.cn, canqun@nudt.edu.cn, duyunfei@nudt.edu.cn, juanchen@nudt.edu.cn, huizhanyi@nudt.edu.cn, xuwx@nudt.edu.cn
  • About author:Feng Wang received his Bachelor's and Master's degrees both in computer science from the National University of Defense Technology, China, in 2000 and 2002 respectively. He is currently an assistant professor and pursuing his Ph.D. degree at the university. His research interests include compiler techniques for high performance, compiler optimization and verification for embedded systems, and parallel programming. He is a member of CCF and ACM.
    Can-Qun Yang received the M.S. and Ph.D. degrees both in computer science from the National University of Defense Technology (NUDT), China, in 1995 and 2008, respectively. Currently he is a professor at the university. His research interests include programming languages and compiler implementation. He is the major designer dealing with the compiler system of the TianHe Supercomputer.
    Yun-Fei Du received the B.S. degree from the Beijing Institute of Technology in 2001 and the Ph.D. degree from the National University of Defense Technology (NUDT), China, in 2008. He is currently an assistant professor at NUDT. His research interests focus on parallel and distributed systems, fault tolerance, and scientific computing.
    Juan Chen received the Ph.D. degree from the National University of Defense Technology, China. Currently she is an assistant professor at the university and her interests include large-scale parallel computing, low-power compiler, and GPU computing.
    Hui-Zhan Yi received the Ph.D. degree from the National University of Defense Technology, China. Currently he is an assistant professor at the university. His research interests include low-power compilation optimization, parallel programming languages.
    Wei-Xia Xu is a professor at the National University of Defense Technology, China. His research interest focuses on the computer architecture.
  • Supported by:

    Supported by the National High Technology Research and Development 863 Program of China under Grant No. 2009AA01A128, the Major Science and Technology Project of China under Grant No. 2009ZX01036-001-003-001, the National Natural Science Foun- dation of China under Grant Nos. 61003087, 60903044, 60903059, 60970033, and 60673150.

In this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting of MPI, OpenMP and streaming computing is described to explore the task parallel, thread parallel and data parallel of the Linpack. We explain how we optimized the load distribution across the CPUs and GPUs using the two-level adaptive method and describe the implementation in details. To overcome the low-bandwidth between the CPU and GPU communication, we present a software pipelining technique to hide the communication overhead. Combined with other traditional optimizations, the Linpack we developed achieved 196:7 GFLOPS on a single compute element of TianHe-1. This result is 70:1% of the peak compute capability, 3:3 times faster than the result by using the vendor's library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0:563 PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November, 2009.

[1] Dongarra J J, van de Geijn R A, Walker D W. Scalability issues affecting the design of a dense linear algebra library. J. Parallel Distrib. Comput., 1994, 22(3): 523-537.

[2] http://www.top500.org, Nov. 10, 2010.

[3] Villarreal J, Najjar W. Compiled hardware acceleration of molecular dynamics code. In Proc. International Conference on Field Programmable Logic and Applications (FPL 2008), Heidelberg, Germany, Sept. 8-10, 2008, pp.667-670.

[4] NVIDIA. Fermi compute architecture whitepaper, 2009.

[5] AMD. AMD stream computing user guide v 1.4.0, Feb. 2009.

[6] NVIDIA. CUDA programming guide, June 2007.

[7] Munshi A. Opencl parallel computing on the GPU and CPU. In Proc. ACM SIGGRAPH 2008, Los Angeles, USA, Aug. 11- 15, 2008.

[8] Falcao G, Yamagiwa S, Silva V, Sousa L. Parallel LDPC decoding on GPUs using a stream-based computing approach. Journal of Computer Science and Technology, 2009, 24(5): 913-924.

[9] Roberts E, Stone J E, Sepulveda L, Mei W, Hwu W, LutheySchulten Z. Long time-scale simulations of in vivo diffusion using GPU hardware. In Proc. the 2009 IEEE International Symposium on Parallel&Distributed Processing (IPDPS 2009), Rome, Italy, May 23-29, 2009, pp.1-8.

[10] Meng J, Skadron K. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proc. the 23rd International Conference on Supercomputing (ICS 2009), Yorktown Heights, USA, Jun. 8-12, 2009, pp.256-265.

[11] Di P, Wan Q, Zhang X, Wu H, Xue J. Toward harnessing DOACROSS parallelism for multi-GPGPUs. In Proc. the 39th International Conference on Parallel Processing, San Diego, USA, Sept. 13-16, 2010, pp.40-50.

[12] Fan Z, Qiu F, Kaufman A, Yoakum-Stover S. GPU cluster for high performance computing. In Proc. the 2004 ACM/IEEE Conference on Supercomputing (SC 2004), Pittsburgh, USA, Nov. 6-12, 2004, p.47.

[13] Sun J C, Yuan G X, Zhang L B, Zhang Y Q. 2009 China top100 list of high performance computer., Nov. 2009.

[14] Petitet A, Whaley R C, Dongarra J J, Cleary A. HPL —— A portable implementation of the high-performance linpack benchmark for distributed memory computers. http://www.netlib.org/benchmark/hpl/, 2006.

[15] Luk C K, Hong S, Kim H. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (Micro-42), New York, USA, Dec. 12-16, 2009, pp.45-55.

[16] Dongarra J J, Luszczek P, Petitet A. The linpack benchmark: Past, present and future. Concurrency and Computation: Practice and Experience, 2003, 15(9): 803-820.

[17] Dongarra J J, Du Croz J, Hammarling S, Duff I S. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 1990, 16(1): 1-17.

[18] Kistler M, Gunnels J, Brokenshire D, Benton B. Petascale computing with accelerators. In Proc. the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2009), Raleigh, USA, Feb. 14-18, 2009, pp.241-250.

[19] Baliga H, Cooray N, Gamsaragan E, Smith P, Yoon K, Abel J, Valles A. Original 45nm Intels Core2 processor performance. Intel Technology Journal, 2008, 11: 157-168.

[20] AMD. AMD core math library for graphic processors release notes for version 1.0, 2009.

[21] Agarwal R, Balle S M, Gustavson F G, Joshi M, Palkar P. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development, 1995, 39(5): 575- 582.

[22] Ryoo S, Rodrigues C I, Baghsorkhi S S, Stone S S, Kirk D B, Hwu W M W. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2008), Salt Lake City, Feb. 20-23, 2008, pp.73-82.

[23] Quintana-Ortí G, Igual F D, Quintana-Ortí E S, van de Geijn R A. Solving dense linear systems on platforms with multiple hardware accelerators. In Proc. the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2009), Raleigh, USA, Feb. 14-18, 2009, pp.121- 130.

[24] Linderman M D, Collins J D, Wang H, Meng T H. Merge: A programming model for heterogeneous multi-core systems. SIGOPS Oper. Syst. Rev., 2008, 42(2): 287-296.

[25] Fatica M. Accelerating linpack with CUDA on heterogenous clusters. In Proc. 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2), Washington DC, USA, 2009, pp.46-51.

[26] Johns C R, Brokenshire D A. Introduction to the cell broadband engine architecture. IBM J. Res. Dev., 2007, 51(5): 503-519.

[27] ATI Radeon rv770. http://en.wikipedia.org/wiki/Radeon R700.

[28] Hamano T, Endo T, Matsuoka S. Power-aware dynamic task scheduling for heterogeneous accelerated clusters. In Proc. Int. Parallel and Distributed Processing Symposium, Rome, Italy, May 23-29, 2009, pp.1-8.

[29] Clearspeed Technology Inc. http://www.clearspeed.com/.

[30] NVIDIA. http://www.nvidia.com/object/product tesla s1070 us.html, Nov. 10, 2010.

[31] Endo T, Matsuoka S. Massive supercomputing coping with heterogeneity of modern accelerators. In Proc. the 2008 IEEE International Symposium on Parallel&Distributed Processing (IPDPS 2008), Miami, USA, Apr. 14-18, 2008, pp.1-10.
No related articles found!
Full text



[1] Chen Shihua;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[2] Lian Lin; Zhang Yili; Tang Changjie;. A Non-Recursive Algorithm Computing Set Expressions[J]. , 1988, 3(4): 310 -316 .
[3] Zhu Mingyuan;. Two Congruent Semantics for Prolog with CUT[J]. , 1990, 5(1): 82 -91 .
[4] Zhao Zhaokeng; Dai Jun; Chen Wendan;. Automated Theorem Proving in Temporal Logic:T-Resolution[J]. , 1994, 9(1): 53 -62 .
[5] Jin Guohua; Chen Fujie;. On the Problem of Optimizing Parallel Programs for Complex Memory Hierarchies[J]. , 1994, 9(1): 1 -26 .
[6] Zhou Jianqiang; Xie Li; Dai Fei; Sun Zhongxiu;. Adaptive Memory Coherence Algorithms in DSVM[J]. , 1994, 9(4): 365 -372 .
[7] Xu Manwu; Lu Jianfeng; Zeng Fancong; Dai Jinwn;. A Formal Semantics for DAI Language NUML[J]. , 1995, 10(3): 227 -238 .
[8] Hao Ruibing; Wu Jianping;. A Formal Approach to Protocol Interoperability Testing[J]. , 1998, 13(1): 79 -90 .
[9] Guan Weiguang; Xie Lin; Ma Songde;. Deformable Registration of Digital Images[J]. , 1998, 13(3): 246 -260 .
[10] HUANG Xiong; LI wei;. On k-Positive Satisfiability Problem[J]. , 1999, 14(4): 309 -313 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved