We use cookies to improve your experience with our site.

Indexed in:

SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.

Submission System
(Author / Reviewer / Editor)
Ning-Hui Sun, Jing Xing, Zhi-Gang Huo, Guang-Ming Tan, Jin Xiong, Bo Li, Can Ma. Dawning Nebulae: A PetaFLOPS Supercomputer with a Heterogeneous Structure[J]. Journal of Computer Science and Technology, 2011, 26(3): 352-362. DOI: 10.1007/s11390-011-1138-3
Citation: Ning-Hui Sun, Jing Xing, Zhi-Gang Huo, Guang-Ming Tan, Jin Xiong, Bo Li, Can Ma. Dawning Nebulae: A PetaFLOPS Supercomputer with a Heterogeneous Structure[J]. Journal of Computer Science and Technology, 2011, 26(3): 352-362. DOI: 10.1007/s11390-011-1138-3

Dawning Nebulae: A PetaFLOPS Supercomputer with a Heterogeneous Structure

Funds: This work is supported by the National Hi-Tech Research and Development 863 Program of China under Grant No. 2009AA01A129, the National Natural Science Foundation of China under Grant Nos. 60633040, 60803030, 61033009 the National Basic Research 973 Program of China under Grant No. 2011CB302500, the National Natural Science Foundation for Distinguished Young Scholars of China under Grant No. 60925009, and the Foundation for Innovative Research Groups of the National Natural Science Foundation of China under Grant No. 60921002.
More Information
  • Received Date: January 30, 2011
  • Revised Date: March 08, 2011
  • Published Date: May 04, 2011
  • Dawning Nebulae is a heterogeneous system composed of 9280 multi-core x86 CPUs and 4640 NVIDIA Fermi GPUs. With a Linpack performance of 1.271 petaFLOPS, it was ranked the second in the TOP500 List released in June 2010. In this paper, key issues in the system design of Dawning Nebulae are introduced. System tuning methodologies aiming at petaFLOPS Linpack result are presented, including algorithmic optimization and communication improvement. The design of its file I/O subsystem, including HVFS and the underlying DCFS3, is also described. Performance evaluations show that the Linpack efficiency of each node reaches 69.89%, and 1024-node aggregate read and write bandwidths exceed 100 GB/s and 70GB/s respectively. The success of Dawning Nebulae has demonstrated the viability of CPU/GPU heterogeneous structure for future designs of supercomputers.
  • [1]
    Compute unified device architecture. http://www.nvidia.com/object/cuda_home_new.html, 2011.
    [2]
    Petitet A, Whaley R C, Dongarra J, Cleary A. HPL — A portable implementation of the high performance Linpack benchmark for distributed memory computers, version 2.0. http://www.netlib.org/benchmark/hpl/, Sept. 2008.
    [3]
    Fatica M. Accelerating Linpack with CUDA on heterogenous clusters. In Proc. the 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2), Washington DC, USA, Mar. 8, 2009, pp.46-51.
    [4]
    Tan G, Sun N, Gao G R. Improving performance of dynamic programming via parallelism and locality on multi-core architectures. IEEE Transactions on Parallel and Distributed Systems, 2009, 20(2): 261-274.
    [5]
    Nagle D, Serenyi D, Matthews A. The Panasas ActiveScale storage cluster — Delivering scalable high bandwidth storage. In Proc. 2004 IEEE/ACM High Performance Computing, Networking and Storage Conference (SC2004), Pittsburgh, USA, Nov. 6-12, 2004, p.53.
    [6]
    Shvachko K, Huang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th IEEE (MSST2010) Symposium on Massive Storage Systems and Technologies (Research Track), Inchine Village, USA, May 3-7, 2010.
    [7]
    Schmuck F, Haskin R. GPFS: A shared-disk file system for large computing clusters. In Proc. the First USENIX Conference on File and Storage Technologies (FAST2002), Monterey, USA, Jan. 28-30, 2002, Article No.19.
    [8]
    Braam P J. The Lustre storage architecture. White Paper, Cluster File Systems, Inc., Oct. 2003.
    [9]
    http://www.pvfs.org/, 2011.
    [10]
    IBM Tivoli SANergy administrator’s guide, Version 3, Release 2. IBM Corporation, Oct. 2002.
    [11]
    http://www.quantum.com/Products/Software/StorNext/Index.aspx.
    [12]
    http://www.datadomain.com/, 2011.
    [13]
    Ghemawat S, Gobioff H, Leung S T. The Google file system. In Proc. the 19th ACM Symp. Operating Systems Principles (SOSP 2003), New York, USA, Oct. 19-22, 2003, pp.29-43.
    [14]
    http://ceph.newdream.net/, 2011.
    [15]
    Patil S, Gibson G. GIGA+: Scalable directories for shared file systems. Carnegie Mellon University Parallel Data Lab, Technical Report CMU-PDL-08-110, Oct. 2008.
    [16]
    Xing J, Xiong J, Sun N, Ma J. Adaptive and scalable metadata management to support a trillion files. In Proc. the SC2009, Portland, USA, Nov. 14-20, 2009, Article No. 26.
    [17]
    Fagin R, Nievergelt J, Pippenger N, Strong H R. Extendible hashing — A fast access method for dynamic files. ACM Trans. Database Systems, Sept. 1979, 4(3): 315-344.
    [18]
    Zhou Y, Chen Z, Li K. Second-level buffer cache management. IEEE Transactions on Parallel and Distributed Systems, Jun. 2004, 15(6): 505-519.
    [19]
    Chen Z, Zhang Y, Zhou Y, Scott H, Schiefer B. Empirical evaluation of multi-level buffer cache collaboration for storage systems. In Proc. Int. Conf. Measurements and Modeling of Computer Systems (SIGMETRICS 2005), Banff, Canada, Jun. 6-10, 2005, pp.145-156.
    [20]
    Li X, Aboulnaga A, Salem K, Sachedina A, Gao S. Secondtier cache management using write hints. In Proc. the 4th USENIX Conference on File and Storage Technologies (FAST 2005), San Francisco, USA, Dec. 13-16, 2005, pp.115-127.
    [21]
    Jiang S, Zhang X. ULC: A file block placement and replacement protocol to efficiently exploit hierarchical locality in multi-level buffer caches. In Proc. the 24th International Conference on Distributed Computing Systems (ICDCS 2004), Tokyo, Japan, Mar. 24-26, 2004, pp.168-177.
    [22]
    Yadgar G, Factor M, Li K, Schuster A. MC2: Multiple clients on a multilevel cache. In Proc. the 28th International Conference on Distributed Computing Systems (ICDCS 2008), Beijing, China, Jun. 17-20, 2008, pp.722-730.
    [23]
    Li C, Shen K. Managing prefetch memory for data-intensive online servers. In Proc. the 4th USENIX Conference on File and Storage Technologies (FAST 2005), San Francisco, USA, Dec. 13-16, 2005, pp.253-266.
    [24]
    Li C, Shen K, Papathanasiou A. Competitive prefetching for concurrent sequential I/O. In Proc. EuroSys 2007 Conference, Lisbon, Portugal, Mar. 21-23, 2007, pp.189-202.
    [25]
    Liang S, Jiang S, Zhang X. STEP: Sequentiality and thrashing detection based prefetching to improve performance of networked storage servers. In Proc. the 27th International Conference on Distributed Computing Systems (ICDCS 2007), Toronto, Canada, Jun. 25-29, 2007, Article No. 64.
    [26]
    Zhang Z, Lee K, Ma X, Zhou Y. PFC: Transparent optimization of existing prefetching strategies for multi-level storage systems. In Proc. the 28th International Conference on Distributed Computing Systems (ICDCS 2008), Beijing, China, Jun. 17-20, 2008, pp.740-751.
    [27]
    Li M, Varki E, Bhatia S, Merchant A. TaP: Table-based prefetching for storage caches. In Proc. the 6th USENIX Conference on File and Storage Technologies (FAST 2008), San Jose, USA, Feb. 26-29, 2008, Article No. 6.
    [28]
    Nisar, W Liao, A Choudhary. Scaling parallel I/O performance through I/O delegate and caching system. In Proc. the 2008 International Conference on for High Performance Computing, Networking, Storage and Analysis (SC 2008), Austin, USA, Nov. 15-21, 2008, Article No. 9.
    [29]
    Chen Y, Byna S, Sun X, Thakur R, Gropp W. Hiding I/O latency with pre-execution prefetching for parallel applications. In Proc. the 2008 International Conference for High Performance Computing, Networking, Storage and Analysis (SC2008), Austin, USA, Nov. 15-21, 2008, No. 40.
    [30]
    Byna S, Chen Y, Sun X, Thakur R, Gropp W. Parallel I/O prefetching using MPI file caching and I/O signatures. In Proc. the 2008 International Conference for High Performance Computing, Networking, Storage and Analysis (SC2008), Austin, USA, Nov. 15-21, 2008, Article No. 44.
    [31]
    Chen H, Xiong J, Sun N. A novel hint-based I/O mechanism for centralized file server of cluster. In Proc. 2008 IEEE International Conference on Cluster Computing (Cluster 2008), Tsukuba, Japan, Sept. 29-Oct. 1, 2008, pp.194-201.
    [32]
    Norcott W D. Iozone file system benchmark. 2005, http://www.iozone.org/docs/IOzone_msword_98.pdf.
  • Related Articles

    [1]Yu-Qi Li, Li-Quan Xiao, Jing-Hua Feng, Bin Xu, Jian Zhang. AquaSee: Predict Load and Cooling System Faults of Supercomputers Using Chilled Water Data[J]. Journal of Computer Science and Technology, 2020, 35(1): 221-230. DOI: 10.1007/s11390-019-1951-7
    [2]Feng Wang , Can-Qun Yang, Yun-Fei Du, Juan Chen, Hui-Zhan Yi, Wei-Xia Xu. Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer[J]. Journal of Computer Science and Technology, 2011, 26(5): 854-865. DOI: 10.1007/s11390-011-0184-1
    [3]Juan A. S&aacutenchez, Rafael Mar&iacuten-P&eacuterez, Pedro M. Ruiz. Beacon-Less Geographic Routing in Real Wireless Sensor Networks[J]. Journal of Computer Science and Technology, 2008, 23(3): 438-450.
    [4]Zhen-Hua Huang, Jian-Kui Guo, Sheng-Li Sun, Wei Wang. Efficient Optimization of Multiple Subspace Skyline Queries[J]. Journal of Computer Science and Technology, 2008, 23(1): 103-111.
    [5]Xue-Jun Yang, Yong Dou, Qing-Feng Hu. Progress and Challenges in High Performance Computer Technology[J]. Journal of Computer Science and Technology, 2006, 21(5): 674-681.
    [6]Lei Shi, Ying-Jie Han, Xiao-Guang Ding, Lin Wei, Zhi-Min Gu. An SPN-Based Integrated Model for Web Prefetching and Caching[J]. Journal of Computer Science and Technology, 2006, 21(4): 482-489.
    [7]HU Weiwu, ZHANG Fuxin, LIU Haiming. Dynamic Data Prefetching in Home-Based Software DSMs[J]. Journal of Computer Science and Technology, 2001, 16(3).
    [8]SHI Weisong, TANG Zhimin, SHI Jinsong. Using Confidence interval to Summarize the Evaluating Results of DSM Systems[J]. Journal of Computer Science and Technology, 2000, 15(1): 73-83.
    [9]Fang Zhiyi, Ju Jiubin. NONH:A New Cache-Based Coherence Protocol for Linked List Structure DSM System and Its Performance Evaluation[J]. Journal of Computer Science and Technology, 1996, 11(4): 405-415.
    [10]Huang Guoyong, Li Sanli. TSP: A Heterogeneous Multiprocessor Supercomputing System Based on i860XP[J]. Journal of Computer Science and Technology, 1994, 9(3): 285-288.

Catalog

    Article views (82) PDF downloads (6036) Cited by()
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return