Journal of Computer Science and Technology ›› 2021, Vol. 36 ›› Issue (1): 44-55.doi: 10.1007/s11390-020-0783-9

Special Issue: Computer Architecture and Systems

• Special Section on Memory-Centric System Research for High-Performance Computing • Previous Articles     Next Articles

A GPU-Accelerated In-Memory Metadata Management Scheme for Large-Scale Parallel File Systems

Zhi-Guang Chen, Member, CCF, Yu-Bo Liu, Yong-Feng Wang, and Yu-Tong Lu*, Distinguished Member, CCF        

  1. School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
  • Received:2020-07-05 Revised:2020-12-30 Online:2021-01-05 Published:2021-01-23
  • Contact: Yu-Tong Lu
  • About author:Zhi-Guang Chen received his B.S. degree from Harbin Institute of Technology, Harbin, and his M.S. and Ph.D. degrees in computer science and technology from National University of Defense Technology, Changsha. He is an associate professor at Sun Yat-sen University, Guangzhou. His current research interest includes distributed file system, network storage, and solid-state storage system.
  • Supported by:
    This work is supported by the National Key Research and Development Program of China under Grant No. 2018YFB0203904, the National Natural Science Foundation of China under Grant Nos. 61872392, U1811461 and 61832020, the Pearl River Science and Technology Nova Program of Guangzhou under Grant No. 201906010008, Guangdong Natural Science Foundation under Grant No. 2018B030312002, the Major Program of Guangdong Basic and Applied Research under Grant No. 2019B030302002, the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant No. 2016ZT06D211, and the Key-Area Research and Development Program of Guang Dong Province of China under Grant No. 2019B010107001.

Driven by the increasing requirements of high-performance computing applications, supercomputers are prone to containing more and more computing nodes. Applications running on such a large-scale computing system are likely to spawn millions of parallel processes, which usually generate a burst of I/O requests, introducing a great challenge into the metadata management of underlying parallel file systems. The traditional method used to overcome such a challenge is adopting multiple metadata servers in the scale-out manner, which will inevitably confront with serious network and consistence problems. This work instead pursues to enhance the metadata performance in the scale-up manner. Specifically, we propose to improve the performance of each individual metadata server by employing GPU to handle metadata requests in parallel. Our proposal designs a novel metadata server architecture, which employs CPU to interact with file system clients, while offloading the computing tasks about metadata into GPU. To take full advantages of the parallelism existing in GPU, we redesign the in-memory data structure for the name space of file systems. The new data structure can perfectly fit to the memory architecture of GPU, and thus helps to exploit the large number of parallel threads within GPU to serve the bursty metadata requests concurrently. We implement a prototype based on BeeGFS and conduct extensive experiments to evaluate our proposal, and the experimental results demonstrate that our GPU-based solution outperforms the CPU-based scheme by more than 50% under typical metadata operations. The superiority is strengthened further on high concurrent scenarios, e.g., the high-performance computing systems supporting millions of parallel threads.

Key words: GPU-accelerated; in-memory; metadata management; parallel file system;

[1] Braam P. The lustre storage architecture. arXiv:1903.01955, 2009., Oct. 2020.
[2] Weil S A, Brandt S A, Miller E L, Long D D E, Maltzahn C. Ceph:A scalable, high-performance distributed file system.In Proc. the 7th Symposium on Operating Systems Design and Implementation, November 2006, pp.307-320.
[3] Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies, May 2010. DOI:10.1109/MSST.2010.5496972.
[4] Ren K, Zheng Q, Patil S, Gibson G. IndexFS:Scaling file system metadata performance with stateless caching and bulk insertion. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2014, pp.237-248. DOI:10.1109/SC.2014.25.
[5] Liao X, Pang Z, Wang K F, Lu Y, Xie M, Xia J, Dong D, Suo G. High performance interconnect network for Tianhe system. Journal of Computer Science and Technology, 2015, 30(2):259-272. DOI:10.1007/s11390-015-1520- 7.
[6] Davies A, Orsaria A. Scale out with GlusterFS. Linux Journal, 2013, 235:Article No. 1.
[7] Rodeh O, Bacik J, Mason C. BTRFS:The Linux B-tree file system. ACM Transactions on Storage, 2013, 9(3):Article No. 9. DOI:10.1145/2501620.2501623.
[8] Xiao L, Ren K, Zheng Q, Gibson G A. ShardFS vs. IndexFS:Replication vs. caching strategies for distributed metadata management in cloud storage systems. In Proc. the 6th ACM Symposium on Cloud Computing, August 2015, pp.236-249. DOI:10.1145/2806777.2806844.
[9] Li S, Lu Y, Shu J, Hu Y, Li T. LocoFS:A loosely-coupled metadata service for distributed file systems. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 4. DOI:10.1145/3126908.3126928.
[10] Yuan J, Zhan Y, Jannen W et al. Optimizing every operation in a write-optimized file system. In Proc. the 14th USENIX Conference on File and Storage Technologies, February 2016, pp.1-14.
[11] Zheng Q, Ren K, Gibson G, Settlemyer B W, Grider G. DeltaFS:Exascale file systems scale better without dedicated servers. In Proc. the 10th Parallel Data Storage Workshop, November 2015, pp.1-6. DOI:10.1145/2834976.2834984.
[12] Zheng Q, Cranor C D, Guo D et al. Scaling embedded in-situ indexing with DeltaFS. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2018, Article No. 3. DOI:10.1109/SC.2018.00006.
[13] Zheng Q, Ren K, Gibson G. BatchFS:Scaling the file system control plane with client-funded metadata servers. In Proc. the 9th Parallel Data Storage Workshop, November 2014, pp.1-6. DOI:10.1109/PDSW.2014.7.
[14] Liu Y, Lu Y, Chen Z, Zhao M. Pacon:Improving scalability and efficiency of metadata service through partial consistency. In Proc. the IEEE International Parallel and Distributed Processing Symposium, May 2020, pp.986-996. DOI:10.1109/IPDPS47924.2020.00105.
[15] Xu W, Lu Y, Li Q et al. Hybrid hierarchy storage system in MilkyWay-2 supercomputer. Frontiers of Computer Science, 2014, 8(3):367-377. DOI:10.1007/s11704-014-3499-6.
[1] Yu-Tong Lu, Peng Cheng, Zhi-Guang Chen. Design and Implementation of the Tianhe-2 Data Storage and Management System [J]. Journal of Computer Science and Technology, 2020, 35(1): 27-46.
[2] Qi Chen, Kang Chen, Zuo-Ning Chen, Wei Xue, Xu Ji, Bin Yang. Lessons Learned from Optimizing the Sunway Storage System for Higher Application I/O Performance [J]. Journal of Computer Science and Technology, 2020, 35(1): 47-60.
[3] SUN Ninghui;. Reference Implementation of Scalable I/O Low-Level API on Intel Paragon [J]. , 1999, 14(3): 206-223.
Full text



[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Chen Shihua;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Wang Jianchao; Wei Daozheng;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[5] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[6] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[7] Zheng Guoliang; Li Hui;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[8] Huang Xuedong; Cai Lianhong; Fang Ditang; Chi Bianjin; Zhou Li; Jiang Li;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[9] Xu Xiaoshu;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[10] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
  Copyright ©2015 JCST, All Rights Reserved