Journal of Computer Science and Technology ›› 2020, Vol. 35 ›› Issue (1): 61-71.doi: 10.1007/s11390-020-9803-z

Special Issue: Computer Architecture and Systems

• Special Section on Selected I/O Technologies for High-Performance Computing and Data Analytics • Previous Articles     Next Articles

Gfarm/BB—Gfarm File System for Node-Local Burst Buffer

Osamu Tatebe1,*, Member, ACM, Shukuko Moriwake2, Yoshihiro Oyama3        

  1. 1 Center for Computational Sciences, University of Tsukuba, Ibaraki 3058577, Japan;
    2 SURIGIKEN Co., Ltd., Tokyo 1000004, Japan;
    3 Faculty of Engineering, Information and Systems, University of Tsukuba, Ibaraki 3058573, Japan
  • Received:2019-07-01 Revised:2019-11-03 Online:2020-01-05 Published:2020-01-14
  • Contact: Osamu Tatebe E-mail:tatebe@cs.tsukuba.ac.jp
  • About author:Osamu Tatebe received his Ph.D. degree in computer science from the University of Tokyo, Tokyo, in 1997. He worked at Electrotechnical Laboratory (ETL), and National Institute of Advanced Industrial Science and Technology (AIST) until 2006. He is currently a professor in Center for Computational Sciences at University of Tsukuba, Ibaraki. His research area is high-performance computing, dataintensive computing, and parallel and distributed system software. He is a member of ACM, IPSJ, and JSIAM.
  • Supported by:
    This work is partially supported by the JSPS KAKENHI Grant No. 17H01748, JST CREST Grant No. JPMJCR1414, New Energy and Industrial Technology Development Organization (NEDO), and Fujitsu Laboratories.

Burst buffer has become a major component to meet the I/O performance requirement of HPC bursty traffic. This paper proposes Gfarm/BB that is a file system for a burst buffer efficiently exploiting node-local storage systems. Although node-local storages improve storage performance, they are only available during the job allocation. Gfarm/BB should have better access and metadata performance while it should be constructed on-demand before the job execution. To improve the read and write performance, it exploits the file descriptor passing and remote direct memory access (RDMA). It improves the metadata performance by omitting the persistency and the redundancy since it is a temporal file system. Using RDMA, writes and reads bandwidth are improved by 1.7x and 2.2x compared with IP over InfiniBand (IPoIB), respectively. It achieves 14 700 operations per second in the directory creation performance, which is 13.4x faster than the fully persistent and redundant case. The construction of Gfarm/BB takes 0.31 seconds using 2 nodes. IOR benchmark and ARGOT-IO application I/O benchmark show the scalable performance improvement by exploiting the locality of node-local storages. Compared with BeeOND, Gfarm/BB shows 2.6x and 2.4x better performance in IOR write and read benchmarks, respectively, and it shows 2.5x better performance in ARGOT-IO.

Key words: burst buffer; node-local storage; on-demand file system; remote direct memory access;

[1] Bhimji W, Bard D, Romanus M et al. Accelerating science with the NERSC burst buffer early user program. In Proc. the 2016 Cray User Group, May 2016.
[2] Bent J, Gibson G, Grider G, McClelland B, Nowoczynski P, Nunez J, Polte M, Wingate M. PLFS:A checkpoint filesystem for parallel applications. In Proc. the 2009 ACM/IEEE Conference on High Performance Computing Networking, Storage and Analysis, Nov. 2009, Article No. 6.
[3] Nisar A, Liao W, Choudhary A. Delegation-based I/O mechanism for high performance computing systems. IEEE Trans. Parallel and Distributed Systems, 2012, 23(2):271-279.
[4] Tatebe O, Hiraga K, Soda N. Gfarm grid file system. New Generation Computing, 2010, 28(3):257-275.
[5] Callaghan B, Lingutla-Raj T, Chiu A, Staubach P, Asad O. NFS over RDMA. In Proc. the ACM SIGCOMM Workshop on Network-I/O Convergence:Experience, Lessons, Implications, August 2003, pp.196-208.
[6] Talpey T, Callaghan B. Remote direct memory access transport for remote procedure call. https://tools.ietf.org/html/rfc5666,Sept.2019.
[7] Talpey T, Callaghan B. Network file system (NFS) direct data placement. https://tools.ietf.org/html/rfc5667,Sept.2019.
[8] Islam N S, Rahman M W, Jose J, Rajachandrasekar R, Wang H, Subramoni H, Murthy C, Panda D K. High performance RDMA-based design of HDFS over InfiniBand. In Proc. the 2012 Int. Conference on High Performance Computing, Networking, Storage and Analysis, November 2012, Article No. 35.
[9] Cooper B F, Silberstein A, Tam E, Ramakrishnan R, Sears R. Benchmarking cloud serving systems with YCSB. In Proc. the 1st ACM Symp. Cloud Computing, June 2010, pp.143-154.
[10] Sasaki S, Takahashi K, Oyama Y, Tatebe O. RDMA-based direct transfer of file data to remote page cache. In Proc. the 2015 IEEE Int. Conference on Cluster Computing, September 2015, pp.214-225.
[11] Rajachandrasekar R, Moody A, Mohror K, Panda D K. A 1 PB/s file system to checkpoint three million MPI tasks. In Proc. the 22nd Int. Symp. High-performance Parallel and Distributed Computing, June 2013, pp.143-154.
[12] Moody A, Bronevetsky G, Mohror K, de Supinski B R. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proc. the 2010 ACM/IEEE Int. Conference for High Performance Computing, Networking, Storage and Analysis, November 2010, Article No. 22.
[13] Wang T, Mohror K, Moody A, Sato K, Yu W K. An ephemeral burst-buffer file system for scientific applications. In Proc. the 2016 Int. Conference for High Performance Computing, Networking, Storage and Analysis, November 2016, pp.807-818.
[14] Greenberg H, Bent J, Grider G. MDHIM:A parallel key/value framework for HPC. In Proc. the 7th USENIX Workshop on Hot Topics in Storage and File Systems, July 2015, Article No. 10.
[15] Wang T, Moody A, Zhu Y, Mohror K, Sato K, Islam T, Yu W. MetaKV:A key-value store for metadata management of distributed burst buffers. In Proc. the 2017 IEEE Int. Parallel and Distributed Processing Symp., May 2017, pp.1174-1183.
[16] Vazhkudai S S, de Supinski B R, Bland A S et al. The design, deployment, and evaluation of the CORAL preexascale systems. In Proc. the 2018 Int. Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 52.
[17] Hilland J, Culley P, Pinkerton J, Recio R. RDMA Protocol Verbs Specification. https://tools.ietf.org/html/drafthilland-rddp-verbs-00,Sept.2019.
[18] Vangoor B K R, Tarasov V, Zadok E. To FUSE or not to FUSE:Performance of user-pace file systems. In Proc. the 15th USENIX Conference on File and Storage Technologies, February 2017, pp.59-72.
[1] André Brinkmann, Kathryn Mohror, Weikuan Yu, Philip Carns, Toni Cortes, Scott A. Klasky, Alberto Miranda, Franz-Josef Pfreundt, Robert B. Ross, Marc-André Vef. Ad Hoc File Systems for High-Performance Computing [J]. Journal of Computer Science and Technology, 2020, 35(1): 4-26.
[2] Anthony Kougkas, Hariharan Devarajan, Xian-He Sun. I/O Acceleration via Multi-Tiered Data Buffering and Prefetching [J]. Journal of Computer Science and Technology, 2020, 35(1): 92-120.
[3] Marc-André Vef, Nafiseh Moti, Tim Süß, Markus Tacke, Tommaso Tocci, Ramon Nou, Alberto Miranda, Toni Cortes, André Brinkmann. GekkoFS—A Temporary Burst Buffer File System for HPC Applications [J]. Journal of Computer Science and Technology, 2020, 35(1): 72-91.
[4] Yang Hong, Yang Zheng, Fan Yang, Bin-Yu Zang, Hai-Bing Guan, Hai-Bo Chen. Scaling out NUMA-Aware Applications with RDMA-Based Distributed Shared Memory [J]. Journal of Computer Science and Technology, 2019, 34(1): 94-112.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Li Wei;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] Chen Shihua;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[4] Feng Yulin;. Recursive Implementation of VLSI Circuits[J]. , 1986, 1(2): 72 -82 .
[5] Sun Zhongxiu; Shang Lujun;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[6] Pan Qijing;. A Routing Algorithm with Candidate Shortest Path[J]. , 1986, 1(3): 33 -52 .
[7] Wu Enhua;. A Graphics System Distributed across a Local Area Network[J]. , 1986, 1(3): 53 -64 .
[8] Qu Yanwen;. AGDL: A Definition Language for Attribute Grammars[J]. , 1986, 1(3): 80 -91 .
[9] Wang Jianchao; Wei Daozheng;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[10] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved