Journal of Computer Science and Technology ›› 2020, Vol. 35 ›› Issue (1): 61-71.doi: 10.1007/s11390-020-9803-z

Special Issue: Computer Architecture and Systems

Previous Articles     Next Articles

Gfarm/BB—Gfarm File System for Node-Local Burst Buffer

Osamu Tatebe1,*, Member, ACM, Shukuko Moriwake2, Yoshihiro Oyama3   

  1. 1 Center for Computational Sciences, University of Tsukuba, Ibaraki 3058577, Japan;
    2 SURIGIKEN Co., Ltd., Tokyo 1000004, Japan;
    3 Faculty of Engineering, Information and Systems, University of Tsukuba, Ibaraki 3058573, Japan
  • Received:2019-07-01 Revised:2019-11-03 Online:2020-01-05 Published:2020-01-14
  • Contact: Osamu Tatebe
  • About author:Osamu Tatebe received his Ph.D. degree in computer science from the University of Tokyo, Tokyo, in 1997. He worked at Electrotechnical Laboratory (ETL), and National Institute of Advanced Industrial Science and Technology (AIST) until 2006. He is currently a professor in Center for Computational Sciences at University of Tsukuba, Ibaraki. His research area is high-performance computing, dataintensive computing, and parallel and distributed system software. He is a member of ACM, IPSJ, and JSIAM.
  • Supported by:
    This work is partially supported by the JSPS KAKENHI Grant No. 17H01748, JST CREST Grant No. JPMJCR1414, New Energy and Industrial Technology Development Organization (NEDO), and Fujitsu Laboratories.

Burst buffer has become a major component to meet the I/O performance requirement of HPC bursty traffic. This paper proposes Gfarm/BB that is a file system for a burst buffer efficiently exploiting node-local storage systems. Although node-local storages improve storage performance, they are only available during the job allocation. Gfarm/BB should have better access and metadata performance while it should be constructed on-demand before the job execution. To improve the read and write performance, it exploits the file descriptor passing and remote direct memory access (RDMA). It improves the metadata performance by omitting the persistency and the redundancy since it is a temporal file system. Using RDMA, writes and reads bandwidth are improved by 1.7x and 2.2x compared with IP over InfiniBand (IPoIB), respectively. It achieves 14 700 operations per second in the directory creation performance, which is 13.4x faster than the fully persistent and redundant case. The construction of Gfarm/BB takes 0.31 seconds using 2 nodes. IOR benchmark and ARGOT-IO application I/O benchmark show the scalable performance improvement by exploiting the locality of node-local storages. Compared with BeeOND, Gfarm/BB shows 2.6x and 2.4x better performance in IOR write and read benchmarks, respectively, and it shows 2.5x better performance in ARGOT-IO.

Key words: burst buffer, node-local storage, on-demand file system, remote direct memory access

[1] Bhimji W, Bard D, Romanus M et al. Accelerating science with the NERSC burst buffer early user program. In Proc. the 2016 Cray User Group, May 2016.
[2] Bent J, Gibson G, Grider G, McClelland B, Nowoczynski P, Nunez J, Polte M, Wingate M. PLFS:A checkpoint filesystem for parallel applications. In Proc. the 2009 ACM/IEEE Conference on High Performance Computing Networking, Storage and Analysis, Nov. 2009, Article No. 6.
[3] Nisar A, Liao W, Choudhary A. Delegation-based I/O mechanism for high performance computing systems. IEEE Trans. Parallel and Distributed Systems, 2012, 23(2):271-279.
[4] Tatebe O, Hiraga K, Soda N. Gfarm grid file system. New Generation Computing, 2010, 28(3):257-275.
[5] Callaghan B, Lingutla-Raj T, Chiu A, Staubach P, Asad O. NFS over RDMA. In Proc. the ACM SIGCOMM Workshop on Network-I/O Convergence:Experience, Lessons, Implications, August 2003, pp.196-208.
[6] Talpey T, Callaghan B. Remote direct memory access transport for remote procedure call.,Sept.2019.
[7] Talpey T, Callaghan B. Network file system (NFS) direct data placement.,Sept.2019.
[8] Islam N S, Rahman M W, Jose J, Rajachandrasekar R, Wang H, Subramoni H, Murthy C, Panda D K. High performance RDMA-based design of HDFS over InfiniBand. In Proc. the 2012 Int. Conference on High Performance Computing, Networking, Storage and Analysis, November 2012, Article No. 35.
[9] Cooper B F, Silberstein A, Tam E, Ramakrishnan R, Sears R. Benchmarking cloud serving systems with YCSB. In Proc. the 1st ACM Symp. Cloud Computing, June 2010, pp.143-154.
[10] Sasaki S, Takahashi K, Oyama Y, Tatebe O. RDMA-based direct transfer of file data to remote page cache. In Proc. the 2015 IEEE Int. Conference on Cluster Computing, September 2015, pp.214-225.
[11] Rajachandrasekar R, Moody A, Mohror K, Panda D K. A 1 PB/s file system to checkpoint three million MPI tasks. In Proc. the 22nd Int. Symp. High-performance Parallel and Distributed Computing, June 2013, pp.143-154.
[12] Moody A, Bronevetsky G, Mohror K, de Supinski B R. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proc. the 2010 ACM/IEEE Int. Conference for High Performance Computing, Networking, Storage and Analysis, November 2010, Article No. 22.
[13] Wang T, Mohror K, Moody A, Sato K, Yu W K. An ephemeral burst-buffer file system for scientific applications. In Proc. the 2016 Int. Conference for High Performance Computing, Networking, Storage and Analysis, November 2016, pp.807-818.
[14] Greenberg H, Bent J, Grider G. MDHIM:A parallel key/value framework for HPC. In Proc. the 7th USENIX Workshop on Hot Topics in Storage and File Systems, July 2015, Article No. 10.
[15] Wang T, Moody A, Zhu Y, Mohror K, Sato K, Islam T, Yu W. MetaKV:A key-value store for metadata management of distributed burst buffers. In Proc. the 2017 IEEE Int. Parallel and Distributed Processing Symp., May 2017, pp.1174-1183.
[16] Vazhkudai S S, de Supinski B R, Bland A S et al. The design, deployment, and evaluation of the CORAL preexascale systems. In Proc. the 2018 Int. Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 52.
[17] Hilland J, Culley P, Pinkerton J, Recio R. RDMA Protocol Verbs Specification.,Sept.2019.
[18] Vangoor B K R, Tarasov V, Zadok E. To FUSE or not to FUSE:Performance of user-pace file systems. In Proc. the 15th USENIX Conference on File and Storage Technologies, February 2017, pp.59-72.
[1] Anthony Kougkas, Hariharan Devarajan, Xian-He Sun. I/O Acceleration via Multi-Tiered Data Buffering and Prefetching [J]. Journal of Computer Science and Technology, 2020, 35(1): 92-120.
[2] Marc-André Vef, Nafiseh Moti, Tim Süß, Markus Tacke, Tommaso Tocci, Ramon Nou, Alberto Miranda, Toni Cortes, André Brinkmann. GekkoFS—A Temporary Burst Buffer File System for HPC Applications [J]. Journal of Computer Science and Technology, 2020, 35(1): 72-91.
[3] André Brinkmann, Kathryn Mohror, Weikuan Yu, Philip Carns, Toni Cortes, Scott A. Klasky, Alberto Miranda, Franz-Josef Pfreundt, Robert B. Ross, Marc-André Vef. Ad Hoc File Systems for High-Performance Computing [J]. Journal of Computer Science and Technology, 2020, 35(1): 4-26.
[4] Yang Hong, Yang Zheng, Fan Yang, Bin-Yu Zang, Hai-Bing Guan, Hai-Bo Chen. Scaling out NUMA-Aware Applications with RDMA-Based Distributed Shared Memory [J]. Journal of Computer Science and Technology, 2019, 34(1): 94-112.
Full text



[1] Pan Qijing;. A Routing Algorithm with Candidate Shortest Path[J]. , 1986, 1(3): 33 -52 .
[2] Wu Enhua;. A Graphics System Distributed across a Local Area Network[J]. , 1986, 1(3): 53 -64 .
[3] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[4] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[5] Zhang Fuyan; Cai Shijie; Wang Shu; Ge Ruding;. The Human-Computer Dialogue Management of FCAD System[J]. , 1988, 3(3): 221 -227 .
[6] Deng Yaping; Chen Tinghuai;. A Reliable and Fault-Tolerant Interconnection Network[J]. , 1990, 5(2): 117 -126 .
[7] Lu Ruzhan; Zhang Zheng; Sun Yongqiang;. Construction of the Model of the Lambda Calculus System with Algebraic Operators[J]. , 1991, 6(1): 108 -112 .
[8] Zheng Fangqing;. A Common Reasoning Model and Its Application in Knowledge-Based System[J]. , 1991, 6(1): 59 -65 .
[9] Zhou Quan; Wei Daozheng;. A Complete Critical Path Algorithm for Test Generation of Combinational Circuits[J]. , 1991, 6(1): 74 -82 .
[10] Xu Jiepan; Wang Lei;. A New Approach to Database Auto-Design by Logic[J]. , 1991, 6(2): 201 -204 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
  Copyright ©2015 JCST, All Rights Reserved