›› 2011, Vol. 26 ›› Issue (3): 352-362.doi: 10.1007/s11390-011-1138-3

• Special Section on Advanced Computing Technology in China • Previous Articles     Next Articles

Dawning Nebulae: A PetaFLOPS Supercomputer with a Heterogeneous Structure

Ning-Hui Sun1 (孙凝辉), Member, CCF, IEEE, Jing Xing2,3 (邢晶), Zhi-Gang Huo2 (霍志刚), Member, CCF, ACM, Guang-Ming Tan1 (谭光明), Member, CCF, ACM, Jin Xiong2 (熊劲), Member, ACM, IEEE, Bo Li2,3 (李波), and Can Ma2,3 (马灿)   

  1. 1. Key Laboratory of Computer System and Architecture, Chinese Academy of Sciences, Beijing 100190, China;
    2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
    3. Graduate University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2011-01-31 Revised:2011-03-09 Online:2011-05-05 Published:2011-05-05
  • Supported by:

    This work is supported by the National Hi-Tech Research and Development 863 Program of China under Grant No. 2009AA01A129, the National Natural Science Foundation of China under Grant Nos. 60633040, 60803030, 61033009 the National Basic Research 973 Program of China under Grant No. 2011CB302500, the National Natural Science Foundation for Distinguished Young Scholars of China under Grant No. 60925009, and the Foundation for Innovative Research Groups of the National Natural Science Foundation of China under Grant No. 60921002.

Dawning Nebulae is a heterogeneous system composed of 9280 multi-core x86 CPUs and 4640 NVIDIA Fermi GPUs. With a Linpack performance of 1.271 petaFLOPS, it was ranked the second in the TOP500 List released in June 2010. In this paper, key issues in the system design of Dawning Nebulae are introduced. System tuning methodologies aiming at petaFLOPS Linpack result are presented, including algorithmic optimization and communication improvement. The design of its file I/O subsystem, including HVFS and the underlying DCFS3, is also described. Performance evaluations show that the Linpack efficiency of each node reaches 69.89%, and 1024-node aggregate read and write bandwidths exceed 100 GB/s and 70GB/s respectively. The success of Dawning Nebulae has demonstrated the viability of CPU/GPU heterogeneous structure for future designs of supercomputers.

[1] Compute unified device architecture. http://www.nvidia.com/object/cuda_home_new.html, 2011.

[2] Petitet A, Whaley R C, Dongarra J, Cleary A. HPL — A portable implementation of the high performance Linpack benchmark for distributed memory computers, version 2.0. http://www.netlib.org/benchmark/hpl/, Sept. 2008.

[3] Fatica M. Accelerating Linpack with CUDA on heterogenous clusters. In Proc. the 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2), Washington DC, USA, Mar. 8, 2009, pp.46-51.

[4] Tan G, Sun N, Gao G R. Improving performance of dynamic programming via parallelism and locality on multi-core architectures. IEEE Transactions on Parallel and Distributed Systems, 2009, 20(2): 261-274.

[5] Nagle D, Serenyi D, Matthews A. The Panasas ActiveScale storage cluster — Delivering scalable high bandwidth storage. In Proc. 2004 IEEE/ACM High Performance Computing, Networking and Storage Conference (SC2004), Pittsburgh, USA, Nov. 6-12, 2004, p.53.

[6] Shvachko K, Huang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th IEEE (MSST2010) Symposium on Massive Storage Systems and Technologies (Research Track), Inchine Village, USA, May 3-7, 2010.

[7] Schmuck F, Haskin R. GPFS: A shared-disk file system for large computing clusters. In Proc. the First USENIX Conference on File and Storage Technologies (FAST2002), Monterey, USA, Jan. 28-30, 2002, Article No.19.

[8] Braam P J. The Lustre storage architecture. White Paper, Cluster File Systems, Inc., Oct. 2003.

[9] http://www.pvfs.org/, 2011.

[10] IBM Tivoli SANergy administrator’s guide, Version 3, Release 2. IBM Corporation, Oct. 2002.

[11] http://www.quantum.com/Products/Software/StorNext/Index.aspx.

[12] http://www.datadomain.com/, 2011.

[13] Ghemawat S, Gobioff H, Leung S T. The Google file system. In Proc. the 19th ACM Symp. Operating Systems Principles (SOSP 2003), New York, USA, Oct. 19-22, 2003, pp.29-43.

[14] http://ceph.newdream.net/, 2011.

[15] Patil S, Gibson G. GIGA+: Scalable directories for shared file systems. Carnegie Mellon University Parallel Data Lab, Technical Report CMU-PDL-08-110, Oct. 2008.

[16] Xing J, Xiong J, Sun N, Ma J. Adaptive and scalable metadata management to support a trillion files. In Proc. the SC2009, Portland, USA, Nov. 14-20, 2009, Article No. 26.

[17] Fagin R, Nievergelt J, Pippenger N, Strong H R. Extendible hashing — A fast access method for dynamic files. ACM Trans. Database Systems, Sept. 1979, 4(3): 315-344.

[18] Zhou Y, Chen Z, Li K. Second-level buffer cache management. IEEE Transactions on Parallel and Distributed Systems, Jun. 2004, 15(6): 505-519.

[19] Chen Z, Zhang Y, Zhou Y, Scott H, Schiefer B. Empirical evaluation of multi-level buffer cache collaboration for storage systems. In Proc. Int. Conf. Measurements and Modeling of Computer Systems (SIGMETRICS 2005), Banff, Canada, Jun. 6-10, 2005, pp.145-156.

[20] Li X, Aboulnaga A, Salem K, Sachedina A, Gao S. Secondtier cache management using write hints. In Proc. the 4th USENIX Conference on File and Storage Technologies (FAST 2005), San Francisco, USA, Dec. 13-16, 2005, pp.115-127.

[21] Jiang S, Zhang X. ULC: A file block placement and replacement protocol to efficiently exploit hierarchical locality in multi-level buffer caches. In Proc. the 24th International Conference on Distributed Computing Systems (ICDCS 2004), Tokyo, Japan, Mar. 24-26, 2004, pp.168-177.

[22] Yadgar G, Factor M, Li K, Schuster A. MC2: Multiple clients on a multilevel cache. In Proc. the 28th International Conference on Distributed Computing Systems (ICDCS 2008), Beijing, China, Jun. 17-20, 2008, pp.722-730.

[23] Li C, Shen K. Managing prefetch memory for data-intensive online servers. In Proc. the 4th USENIX Conference on File and Storage Technologies (FAST 2005), San Francisco, USA, Dec. 13-16, 2005, pp.253-266.

[24] Li C, Shen K, Papathanasiou A. Competitive prefetching for concurrent sequential I/O. In Proc. EuroSys 2007 Conference, Lisbon, Portugal, Mar. 21-23, 2007, pp.189-202.

[25] Liang S, Jiang S, Zhang X. STEP: Sequentiality and thrashing detection based prefetching to improve performance of networked storage servers. In Proc. the 27th International Conference on Distributed Computing Systems (ICDCS 2007), Toronto, Canada, Jun. 25-29, 2007, Article No. 64.

[26] Zhang Z, Lee K, Ma X, Zhou Y. PFC: Transparent optimization of existing prefetching strategies for multi-level storage systems. In Proc. the 28th International Conference on Distributed Computing Systems (ICDCS 2008), Beijing, China, Jun. 17-20, 2008, pp.740-751.

[27] Li M, Varki E, Bhatia S, Merchant A. TaP: Table-based prefetching for storage caches. In Proc. the 6th USENIX Conference on File and Storage Technologies (FAST 2008), San Jose, USA, Feb. 26-29, 2008, Article No. 6.

[28] Nisar, W Liao, A Choudhary. Scaling parallel I/O performance through I/O delegate and caching system. In Proc. the 2008 International Conference on for High Performance Computing, Networking, Storage and Analysis (SC 2008), Austin, USA, Nov. 15-21, 2008, Article No. 9.

[29] Chen Y, Byna S, Sun X, Thakur R, Gropp W. Hiding I/O latency with pre-execution prefetching for parallel applications. In Proc. the 2008 International Conference for High Performance Computing, Networking, Storage and Analysis (SC2008), Austin, USA, Nov. 15-21, 2008, No. 40.

[30] Byna S, Chen Y, Sun X, Thakur R, Gropp W. Parallel I/O prefetching using MPI file caching and I/O signatures. In Proc. the 2008 International Conference for High Performance Computing, Networking, Storage and Analysis (SC2008), Austin, USA, Nov. 15-21, 2008, Article No. 44.

[31] Chen H, Xiong J, Sun N. A novel hint-based I/O mechanism for centralized file server of cluster. In Proc. 2008 IEEE International Conference on Cluster Computing (Cluster 2008), Tsukuba, Japan, Sept. 29-Oct. 1, 2008, pp.194-201.

[32] Norcott W D. Iozone file system benchmark. 2005, http://www.iozone.org/docs/IOzone_msword_98.pdf.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved