计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (1): 44-55.doi: 10.1007/s11390-020-0783-9

所属专题: Computer Architecture and Systems

• • 上一篇    下一篇

基于GPU的大规模并行文件系统元数据加速

Zhi-Guang Chen, Member, CCF, Yu-Bo Liu, Yong-Feng Wang, and Yu-Tong Lu*, Distinguished Member, CCF   

  1. School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
  • 收稿日期:2020-07-05 修回日期:2020-12-30 出版日期:2021-01-05 发布日期:2021-01-23
  • 通讯作者: Yu-Tong Lu E-mail:yutong.lu@nscc-gz.cn
  • 作者简介:Zhi-Guang Chen received his B.S. degree from Harbin Institute of Technology, Harbin, and his M.S. and Ph.D. degrees in computer science and technology from National University of Defense Technology, Changsha. He is an associate professor at Sun Yat-sen University, Guangzhou. His current research interest includes distributed file system, network storage, and solid-state storage system.
  • 基金资助:
    This work is supported by the National Key Research and Development Program of China under Grant No. 2018YFB0203904, the National Natural Science Foundation of China under Grant Nos. 61872392, U1811461 and 61832020, the Pearl River Science and Technology Nova Program of Guangzhou under Grant No. 201906010008, Guangdong Natural Science Foundation under Grant No. 2018B030312002, the Major Program of Guangdong Basic and Applied Research under Grant No. 2019B030302002, the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant No. 2016ZT06D211, and the Key-Area Research and Development Program of Guang Dong Province of China under Grant No. 2019B010107001.

A GPU-Accelerated In-Memory Metadata Management Scheme for Large-Scale Parallel File Systems

Zhi-Guang Chen, Member, CCF, Yu-Bo Liu, Yong-Feng Wang, and Yu-Tong Lu*, Distinguished Member, CCF        

  1. School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
  • Received:2020-07-05 Revised:2020-12-30 Online:2021-01-05 Published:2021-01-23
  • Contact: Yu-Tong Lu E-mail:yutong.lu@nscc-gz.cn
  • About author:Zhi-Guang Chen received his B.S. degree from Harbin Institute of Technology, Harbin, and his M.S. and Ph.D. degrees in computer science and technology from National University of Defense Technology, Changsha. He is an associate professor at Sun Yat-sen University, Guangzhou. His current research interest includes distributed file system, network storage, and solid-state storage system.
  • Supported by:
    This work is supported by the National Key Research and Development Program of China under Grant No. 2018YFB0203904, the National Natural Science Foundation of China under Grant Nos. 61872392, U1811461 and 61832020, the Pearl River Science and Technology Nova Program of Guangzhou under Grant No. 201906010008, Guangdong Natural Science Foundation under Grant No. 2018B030312002, the Major Program of Guangdong Basic and Applied Research under Grant No. 2019B030302002, the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant No. 2016ZT06D211, and the Key-Area Research and Development Program of Guang Dong Province of China under Grant No. 2019B010107001.

元数据一直是文件系统的一大瓶颈。为了提升元数据性能,并行文件系统逐步转向分布式元数据管理方案。我们认为,分布式元数据管理在一致性和可靠性上仍然存在一定的缺陷,相比之下,单节点的元数据性能还存在很大的提升空间。随着存储设备IO性能的不断提升,元数据瓶颈的主要原因逐步由IO转向计算。在此背景下,我们提出基于GPU加速元数据的方案。具体地,我们设计了一种全新的元数据服务器架构,该架构包含CPU、GPU和SSD三个部分。其中,CPU主要负责与客户端交互,从客户端接收元数据请求,并打包传递到GPU中;GPU保存所有的元数据信息,当接收到CPU发来的批量元数据请求后,启动大量的并发线程实施元数据计算,GPU处理完元数据请求后将结果返回到CPU,并由CPU转发到客户端。为了保证元数据的持久化,我们以日志和检查点相结合的形式将元数据保存在SSD上。为了提升GPU中并发线程的计算效率,我们改进了元数据在内存中的数据结构,使之能够高效支持GPU的SIMT计算。我们以BeeGFS为原型实现了基于GPU的元数据加速系统,实验表明,基于GPU的加速方案显著优于基于CPU的元数据管理,在大量客户端并发访问的情况下优势尤其明显。总之,本文针对高性能计算场景,提出了一种新的元数据管理方案,借助GPU的高并发能力,显著缓解计算部件在元数据管理中的瓶颈效应,最终显著提升了单点的元数据性能。值得注意的是,本项工作与分布式元数据管理是不冲突的,所研发的系统能够直接融入元数据集群中。

关键词: GPU加速, 内存元数据, 元数据管理, 并行文件系统

Abstract: Driven by the increasing requirements of high-performance computing applications, supercomputers are prone to containing more and more computing nodes. Applications running on such a large-scale computing system are likely to spawn millions of parallel processes, which usually generate a burst of I/O requests, introducing a great challenge into the metadata management of underlying parallel file systems. The traditional method used to overcome such a challenge is adopting multiple metadata servers in the scale-out manner, which will inevitably confront with serious network and consistence problems. This work instead pursues to enhance the metadata performance in the scale-up manner. Specifically, we propose to improve the performance of each individual metadata server by employing GPU to handle metadata requests in parallel. Our proposal designs a novel metadata server architecture, which employs CPU to interact with file system clients, while offloading the computing tasks about metadata into GPU. To take full advantages of the parallelism existing in GPU, we redesign the in-memory data structure for the name space of file systems. The new data structure can perfectly fit to the memory architecture of GPU, and thus helps to exploit the large number of parallel threads within GPU to serve the bursty metadata requests concurrently. We implement a prototype based on BeeGFS and conduct extensive experiments to evaluate our proposal, and the experimental results demonstrate that our GPU-based solution outperforms the CPU-based scheme by more than 50% under typical metadata operations. The superiority is strengthened further on high concurrent scenarios, e.g., the high-performance computing systems supporting millions of parallel threads.

Key words: GPU-accelerated, in-memory, metadata management, parallel file system

[1] Braam P. The lustre storage architecture. arXiv:1903.01955, 2009. https://arxiv.org/pdf/1903.01955.pdf, Oct. 2020.
[2] Weil S A, Brandt S A, Miller E L, Long D D E, Maltzahn C. Ceph:A scalable, high-performance distributed file system.In Proc. the 7th Symposium on Operating Systems Design and Implementation, November 2006, pp.307-320.
[3] Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies, May 2010. DOI:10.1109/MSST.2010.5496972.
[4] Ren K, Zheng Q, Patil S, Gibson G. IndexFS:Scaling file system metadata performance with stateless caching and bulk insertion. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2014, pp.237-248. DOI:10.1109/SC.2014.25.
[5] Liao X, Pang Z, Wang K F, Lu Y, Xie M, Xia J, Dong D, Suo G. High performance interconnect network for Tianhe system. Journal of Computer Science and Technology, 2015, 30(2):259-272. DOI:10.1007/s11390-015-1520- 7.
[6] Davies A, Orsaria A. Scale out with GlusterFS. Linux Journal, 2013, 235:Article No. 1.
[7] Rodeh O, Bacik J, Mason C. BTRFS:The Linux B-tree file system. ACM Transactions on Storage, 2013, 9(3):Article No. 9. DOI:10.1145/2501620.2501623.
[8] Xiao L, Ren K, Zheng Q, Gibson G A. ShardFS vs. IndexFS:Replication vs. caching strategies for distributed metadata management in cloud storage systems. In Proc. the 6th ACM Symposium on Cloud Computing, August 2015, pp.236-249. DOI:10.1145/2806777.2806844.
[9] Li S, Lu Y, Shu J, Hu Y, Li T. LocoFS:A loosely-coupled metadata service for distributed file systems. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 4. DOI:10.1145/3126908.3126928.
[10] Yuan J, Zhan Y, Jannen W et al. Optimizing every operation in a write-optimized file system. In Proc. the 14th USENIX Conference on File and Storage Technologies, February 2016, pp.1-14.
[11] Zheng Q, Ren K, Gibson G, Settlemyer B W, Grider G. DeltaFS:Exascale file systems scale better without dedicated servers. In Proc. the 10th Parallel Data Storage Workshop, November 2015, pp.1-6. DOI:10.1145/2834976.2834984.
[12] Zheng Q, Cranor C D, Guo D et al. Scaling embedded in-situ indexing with DeltaFS. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2018, Article No. 3. DOI:10.1109/SC.2018.00006.
[13] Zheng Q, Ren K, Gibson G. BatchFS:Scaling the file system control plane with client-funded metadata servers. In Proc. the 9th Parallel Data Storage Workshop, November 2014, pp.1-6. DOI:10.1109/PDSW.2014.7.
[14] Liu Y, Lu Y, Chen Z, Zhao M. Pacon:Improving scalability and efficiency of metadata service through partial consistency. In Proc. the IEEE International Parallel and Distributed Processing Symposium, May 2020, pp.986-996. DOI:10.1109/IPDPS47924.2020.00105.
[15] Xu W, Lu Y, Li Q et al. Hybrid hierarchy storage system in MilkyWay-2 supercomputer. Frontiers of Computer Science, 2014, 8(3):367-377. DOI:10.1007/s11704-014-3499-6.
[1] Yu-Tong Lu, Peng Cheng, Zhi-Guang Chen. Tianhe-2数据存储与管理系统设计与实现[J]. 计算机科学技术学报, 2020, 35(1): 27-46.
[2] Qi Chen, Kang Chen, Zuo-Ning Chen, Wei Xue, Xu Ji, Bin Yang. 神威存储系统面向应用I/O性能提升的优化介绍[J]. 计算机科学技术学报, 2020, 35(1): 47-60.
[3] Fatemeh Azmandian, Ayse Yilmazer, Jennifer G. Dy Javed A. Aslam, and David R. Ka. 一种利用GPU处理加速异常探测特征选择的方法[J]. , 2014, 29(3): 408-422.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[5] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[6] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[7] 郑国梁; 李辉;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[8] 黄学东; 蔡莲红; 方棣棠; 迟边进; 周立; 蒋力;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[9] 许小曙;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[10] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: