计算机科学技术学报 ›› 2020,Vol. 35 ›› Issue (1): 61-71.doi: 10.1007/s11390-020-9803-z

所属专题: Computer Architecture and Systems

• • 上一篇    下一篇

Gfarm/BB—节点本地突发缓冲(Burst Buffer)的Gfarm文件系统

Osamu Tatebe1,*, Member, ACM, Shukuko Moriwake2, Yoshihiro Oyama3   

  1. 1 Center for Computational Sciences, University of Tsukuba, Ibaraki 3058577, Japan;
    2 SURIGIKEN Co., Ltd., Tokyo 1000004, Japan;
    3 Faculty of Engineering, Information and Systems, University of Tsukuba, Ibaraki 3058573, Japan
  • 收稿日期:2019-07-01 修回日期:2019-11-03 出版日期:2020-01-05 发布日期:2020-01-14
  • 通讯作者: Osamu Tatebe E-mail:tatebe@cs.tsukuba.ac.jp
  • 作者简介:Osamu Tatebe received his Ph.D. degree in computer science from the University of Tokyo, Tokyo, in 1997. He worked at Electrotechnical Laboratory (ETL), and National Institute of Advanced Industrial Science and Technology (AIST) until 2006. He is currently a professor in Center for Computational Sciences at University of Tsukuba, Ibaraki. His research area is high-performance computing, dataintensive computing, and parallel and distributed system software. He is a member of ACM, IPSJ, and JSIAM.
  • 基金资助:
    This work is partially supported by the JSPS KAKENHI Grant No. 17H01748, JST CREST Grant No. JPMJCR1414, New Energy and Industrial Technology Development Organization (NEDO), and Fujitsu Laboratories.

Gfarm/BB—Gfarm File System for Node-Local Burst Buffer

Osamu Tatebe1,*, Member, ACM, Shukuko Moriwake2, Yoshihiro Oyama3        

  1. 1 Center for Computational Sciences, University of Tsukuba, Ibaraki 3058577, Japan;
    2 SURIGIKEN Co., Ltd., Tokyo 1000004, Japan;
    3 Faculty of Engineering, Information and Systems, University of Tsukuba, Ibaraki 3058573, Japan
  • Received:2019-07-01 Revised:2019-11-03 Online:2020-01-05 Published:2020-01-14
  • Contact: Osamu Tatebe E-mail:tatebe@cs.tsukuba.ac.jp
  • About author:Osamu Tatebe received his Ph.D. degree in computer science from the University of Tokyo, Tokyo, in 1997. He worked at Electrotechnical Laboratory (ETL), and National Institute of Advanced Industrial Science and Technology (AIST) until 2006. He is currently a professor in Center for Computational Sciences at University of Tsukuba, Ibaraki. His research area is high-performance computing, dataintensive computing, and parallel and distributed system software. He is a member of ACM, IPSJ, and JSIAM.
  • Supported by:
    This work is partially supported by the JSPS KAKENHI Grant No. 17H01748, JST CREST Grant No. JPMJCR1414, New Energy and Industrial Technology Development Organization (NEDO), and Fujitsu Laboratories.

突发缓冲(Burst Buffer)已经成为实现高性能计算(HPC)突发流量I/O性能要求的重要方式之一。本文提出了Gfarm/BB,它是用于突发缓冲的一种文件系统,可以有效利用节点本地储存系统。虽然节点本地储存改善了储存性能,它们只在任务分配时可用。Gfarm/BB应该拥有更好的访问和元数据性能,并且应在任务执行之前按需构建。它利用文件描述符传递和远程直接内存访问(RDMA)提升读写性能。因为它是一个临时的文件系统,所以通过省略持续性和冗余提升了元数据性能。通过使用RDMA,与IP over InfiniBand(IPoIB)相比,写和读的带宽分别提升了1.7倍和2.2倍。在目录创建性能方面,它达到了每秒1.47万次操作,这比完全持续和冗余情况快14.4倍。Gfarm/BB的构建花了0.31秒,使用了2个节点。通过使用节点本地储存,IOR基准和ARGOT-IO应用I/O基准显示了可扩展的性能的提升。基于IOR写和读基准,Gfarm/BB的性能分别是BeeOND的2.6倍和2.4倍;基于ARGOT-IO基准,性能是其2.5倍。

关键词: 突发缓冲, 节点本地储存, 按需文件系统, 远程直接内存访问

Abstract: Burst buffer has become a major component to meet the I/O performance requirement of HPC bursty traffic. This paper proposes Gfarm/BB that is a file system for a burst buffer efficiently exploiting node-local storage systems. Although node-local storages improve storage performance, they are only available during the job allocation. Gfarm/BB should have better access and metadata performance while it should be constructed on-demand before the job execution. To improve the read and write performance, it exploits the file descriptor passing and remote direct memory access (RDMA). It improves the metadata performance by omitting the persistency and the redundancy since it is a temporal file system. Using RDMA, writes and reads bandwidth are improved by 1.7x and 2.2x compared with IP over InfiniBand (IPoIB), respectively. It achieves 14 700 operations per second in the directory creation performance, which is 13.4x faster than the fully persistent and redundant case. The construction of Gfarm/BB takes 0.31 seconds using 2 nodes. IOR benchmark and ARGOT-IO application I/O benchmark show the scalable performance improvement by exploiting the locality of node-local storages. Compared with BeeOND, Gfarm/BB shows 2.6x and 2.4x better performance in IOR write and read benchmarks, respectively, and it shows 2.5x better performance in ARGOT-IO.

Key words: burst buffer, node-local storage, on-demand file system, remote direct memory access

[1] Bhimji W, Bard D, Romanus M et al. Accelerating science with the NERSC burst buffer early user program. In Proc. the 2016 Cray User Group, May 2016.
[2] Bent J, Gibson G, Grider G, McClelland B, Nowoczynski P, Nunez J, Polte M, Wingate M. PLFS:A checkpoint filesystem for parallel applications. In Proc. the 2009 ACM/IEEE Conference on High Performance Computing Networking, Storage and Analysis, Nov. 2009, Article No. 6.
[3] Nisar A, Liao W, Choudhary A. Delegation-based I/O mechanism for high performance computing systems. IEEE Trans. Parallel and Distributed Systems, 2012, 23(2):271-279.
[4] Tatebe O, Hiraga K, Soda N. Gfarm grid file system. New Generation Computing, 2010, 28(3):257-275.
[5] Callaghan B, Lingutla-Raj T, Chiu A, Staubach P, Asad O. NFS over RDMA. In Proc. the ACM SIGCOMM Workshop on Network-I/O Convergence:Experience, Lessons, Implications, August 2003, pp.196-208.
[6] Talpey T, Callaghan B. Remote direct memory access transport for remote procedure call. https://tools.ietf.org/html/rfc5666,Sept.2019.
[7] Talpey T, Callaghan B. Network file system (NFS) direct data placement. https://tools.ietf.org/html/rfc5667,Sept.2019.
[8] Islam N S, Rahman M W, Jose J, Rajachandrasekar R, Wang H, Subramoni H, Murthy C, Panda D K. High performance RDMA-based design of HDFS over InfiniBand. In Proc. the 2012 Int. Conference on High Performance Computing, Networking, Storage and Analysis, November 2012, Article No. 35.
[9] Cooper B F, Silberstein A, Tam E, Ramakrishnan R, Sears R. Benchmarking cloud serving systems with YCSB. In Proc. the 1st ACM Symp. Cloud Computing, June 2010, pp.143-154.
[10] Sasaki S, Takahashi K, Oyama Y, Tatebe O. RDMA-based direct transfer of file data to remote page cache. In Proc. the 2015 IEEE Int. Conference on Cluster Computing, September 2015, pp.214-225.
[11] Rajachandrasekar R, Moody A, Mohror K, Panda D K. A 1 PB/s file system to checkpoint three million MPI tasks. In Proc. the 22nd Int. Symp. High-performance Parallel and Distributed Computing, June 2013, pp.143-154.
[12] Moody A, Bronevetsky G, Mohror K, de Supinski B R. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proc. the 2010 ACM/IEEE Int. Conference for High Performance Computing, Networking, Storage and Analysis, November 2010, Article No. 22.
[13] Wang T, Mohror K, Moody A, Sato K, Yu W K. An ephemeral burst-buffer file system for scientific applications. In Proc. the 2016 Int. Conference for High Performance Computing, Networking, Storage and Analysis, November 2016, pp.807-818.
[14] Greenberg H, Bent J, Grider G. MDHIM:A parallel key/value framework for HPC. In Proc. the 7th USENIX Workshop on Hot Topics in Storage and File Systems, July 2015, Article No. 10.
[15] Wang T, Moody A, Zhu Y, Mohror K, Sato K, Islam T, Yu W. MetaKV:A key-value store for metadata management of distributed burst buffers. In Proc. the 2017 IEEE Int. Parallel and Distributed Processing Symp., May 2017, pp.1174-1183.
[16] Vazhkudai S S, de Supinski B R, Bland A S et al. The design, deployment, and evaluation of the CORAL preexascale systems. In Proc. the 2018 Int. Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 52.
[17] Hilland J, Culley P, Pinkerton J, Recio R. RDMA Protocol Verbs Specification. https://tools.ietf.org/html/drafthilland-rddp-verbs-00,Sept.2019.
[18] Vangoor B K R, Tarasov V, Zadok E. To FUSE or not to FUSE:Performance of user-pace file systems. In Proc. the 15th USENIX Conference on File and Storage Technologies, February 2017, pp.59-72.
[1] André Brinkmann, Kathryn Mohror, Weikuan Yu, Philip Carns, Toni Cortes, Scott A. Klasky, Alberto Miranda, Franz-Josef Pfreundt, Robert B. Ross, Marc-André Vef. 高性能计算专用文件系统[J]. 计算机科学技术学报, 2020, 35(1): 4-26.
[2] Marc-André Vef, Nafiseh Moti, Tim Süß, Markus Tacke, Tommaso Tocci, Ramon Nou, Alberto Miranda, Toni Cortes, André Brinkmann. GekkoFS—一种用于高性能计算应用的临时突发缓冲文件系统[J]. 计算机科学技术学报, 2020, 35(1): 72-91.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 李未;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[4] 冯玉琳;. Recursive Implementation of VLSI Circuits[J]. , 1986, 1(2): 72 -82 .
[5] 孙钟秀; 商陆军;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[6] 潘启敬;. A Routing Algorithm with Candidate Shortest Path[J]. , 1986, 1(3): 33 -52 .
[7] 吴恩华;. A Graphics System Distributed across a Local Area Network[J]. , 1986, 1(3): 53 -64 .
[8] 屈延文;. AGDL: A Definition Language for Attribute Grammars[J]. , 1986, 1(3): 80 -91 .
[9] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[10] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: