Journal of Computer Science and Technology ›› 2020, Vol. 35 ›› Issue (1): 72-91.doi: 10.1007/s11390-020-9797-6

Special Issue: Computer Architecture and Systems

• Special Section on Selected I/O Technologies for High-Performance Computing and Data Analytics • Previous Articles     Next Articles

GekkoFS—A Temporary Burst Buffer File System for HPC Applications

Marc-André Vef1, Nafiseh Moti1, Tim Sü?1, Markus Tacke1, Tommaso Tocci2, Ramon Nou2, Alberto Miranda2, Toni Cortes2,3, André Brinkmann1, Member, ACM        

  1. 1 Zentrum für Datenverarbeitung, Johannes Gutenberg University Mainz, Mainz 55128, Germany;
    2 Barcelona Supercomputing Center, Barcelona 08034, Spain;
    3 Computer Architecture Department, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
  • Received:2019-06-30 Revised:2019-10-03 Online:2020-01-05 Published:2020-01-14
  • About author:Marc-André Vef is a third-year Ph.D. candidate in André Brinkmann's research team at the Johannes Gutenberg University Mainz, Mainz. He started his Ph.D. in 2016 after receiving his B.Sc. and M.Sc. degrees in computer science from the Johannes Gutenberg University Mainz, Mainz. His master thesis was in cooperation with IBM Research about analyzing file create performance in the IBM Spectrum Scale parallel file system (formerly GPFS). His research interests focus on parallel and ad-hoc file systems and system analytics.
  • Supported by:
    This work has been funded by the German Research Foundation (DFG) through the Priority Programme 1648 "Software for Exascale Computing" and the ADA-FS project, and also partially supported by the Spanish Ministry of Science and Innovation under Grant No. TIN2015-65316, the Generalitat de Catalunya under Contract 2014-SGR-1051, as well as the European Union's Horizon 2020 Research and Innovation Programme, under Grant Agreement No. 671951 (NEXTGenIO).

Many scientific fields increasingly use high-performance computing (HPC) to process and analyze massive amounts of experimental data while storage systems in today's HPC environments have to cope with new access patterns. These patterns include many metadata operations, small I/O requests, or randomized file I/O, while general-purpose parallel file systems have been optimized for sequential shared access to large files. Burst buffer file systems create a separate file system that applications can use to store temporary data. They aggregate node-local storage available within the compute nodes or use dedicated SSD clusters and offer a peak bandwidth higher than that of the backend parallel file system without interfering with it. However, burst buffer file systems typically offer many features that a scientific application, running in isolation for a limited amount of time, does not require. We present GekkoFS, a temporary, highly-scalable file system which has been specifically optimized for the aforementioned use cases. GekkoFS provides relaxed POSIX semantics which only offers features which are actually required by most (not all) applications. GekkoFS is, therefore, able to provide scalable I/O performance and reaches millions of metadata operations already for a small number of nodes, significantly outperforming the capabilities of common parallel file systems.

Key words: distributed file system; high-performance computing (HPC); burst buffer; POSIX (portable operating system interface);

[1] Hey T, Tansley S, Tolle K M. The Fourth Paradigm:DataIntensive Scientific Discovery (1st edition). Microsoft Research, 2009.
[2] Ross R, Thakur R, Choudhary A. Achievements and challenges for I/O in computational science. Journal of Physics:Conference Series, 2005, 16(1):501-509.
[3] Nieuwejaar N, Kotz D, Purakayastha A, Ellis C S, Best M L. File-access characteristics of parallel scientific workloads. IEEE Trans. Parallel Distrib. Syst., 1996, 7(10):1075-1089.
[4] Wang F, Xin Q, Hong B, Brandt S A, Miller E, Long D, McLarty T. File system workload analysis for large scientific computing applications. In Proc. the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies, April 2004, pp.139-152.
[5] Crandall P, Aydt R A, Chien A A, Reed D A. Input/output characteristics of scalable parallel applications. In Proc. the 1995 Supercomputing, December 1995, Article No. 59.
[6] Dorier M, Antoniu G, Ross R B, Kimpe D, Ibrahim S. CALCioM:Mitigating I/O interference in HPC systems through cross-application coordination. In Proc. the 28th IEEE International Parallel and Distributed Processing Symposium, May 2014, pp.155-164.
[7] Thapaliya S, Bangalore P, Lofstead J F, Mohror K, Moody A. Managing I/O interference in a shared burst buffer system. In Proc. the 45th International Conference on Parallel Processing, August 2016, pp.416-425.
[8] Lofstead J F, Klasky S, Schwan K, Podhorszki N, Jin C. Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In Proc. the 6th International Workshop on Challenges of Large Applications in Distributed Environments, June 2008, pp.15-24.
[9] Folk M, Cheng A, Yates K. HDF5:A file format and I/O library for high performance computing applications. In Proc. the 1999 Supercomputing (CD-ROM), November 1999, pp.5-33.
[10] Liu N, Cope J, Carns P H, Carothers C D, Ross R B, Grider G, Crume A, Maltzahn C. On the role of burst buffers in leadership-class storage systems. In Proc. the 28th IEEE Symposium on Mass Storage Systems and Technologies, April 2012, Article No. 5.
[11] Wang T, Mohror K, Moody A, Sato K, Yu W. An ephemeral burst-buffer file system for scientific applications. In Proc. the 2016 International Conference for High Performance Computing, November 2016, pp.807-818.
[12] Bent J, Gibson G A, Grider G, McClelland B, Nowoczynski P, Nunez J, Polte M, Wingate M. PLFS:A checkpoint filesystem for parallel applications. In Proc. the 2009 ACM/IEEE Conference on High Performance Computing, November 2009, Article No. 26.
[13] Vilayannur M, Nath P, Sivasubramaniam A. Providing tunable consistency for a parallel file store. In Proc. the 2005 Conference on File and Storage Technologies, December 2005, Article No. 3.
[14] Lensing P H, Cortes T, Hughes J, Brinkmann A. File system scalability with highly decentralized metadata on independent storage devices. In Proc. the 16th the IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2016, pp.366-375.
[15] Soumagne J, Kimpe D, Zounmevo J A, Chaarawi M, Koziol Q, Afsahi A, Ross R B. Mercury:Enabling remote procedure call for high-performance computing. In Proc. the 2013 IEEE International Conference on Cluster Computing, September 2013, Article No. 50.
[16] Seo S, Amer A, Balaji P, Bordage C et al. Argobots:A lightweight low-level threading and tasking framework. IEEE Trans. Parallel Distrib. Syst., 2018, 29(3):512-526.
[17] Carns P H, Jenkins J, Cranor C D, Atchley S, Seo S, Snyder S, Ross R B. Enabling NVM for data-intensive scientific services. In Proc. the 4th Workshop on Interactions of NVM/Flash with Operating Systems and Workloads, November 2016, Article No. 4.
[18] Jasak H, Jemcov A, Tukovic Z et al. OpenFOAM:A C++ library for complex physics simulations. In Proc. the International Workshop on Coupled Methods in Numerical Dynamics, September 2007, Article No. 3.
[19] Vef M, Moti N, Süß T, Tocci T, Nou R, Miranda A, Cortes T, Brinkmann A. GekkoFS-A temporary distributed file system for HPC applications. In Proc. the 2018 IEEE International Conference on Cluster Computing, September 2018, pp.319-324.
[20] Schmuck F B, Haskin R L. GPFS:A shared-disk file system for large computing clusters. In Proc. the 2002 Conference on File and Storage Technologies, January 2002, pp.231-244.
[21] Braam P J, Schwan P. Lustre:The intergalactic file system. In Proc. the 2002 Ottawa Linux Symposium, June 2002, pp.50-54.
[22] Qian Y, Li X, Ihara S, Zeng L, Kaiser J, Süß T, Brinkmann A. A configurable rule based classful token bucket filter network request scheduler for the Lustre file system. In Proc. the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 6.
[23] Herold F, Breuner S. An introduction to BeeGFS. https://www.beegfs.io/docs/whitepapers/Introduction_to_BeeGFS_by_ThinkParQ.pdf,August 2019.
[24] Ross R B, Latham R. PVFS-PVFS:A parallel file system. In Proc. the 2006 ACM/IEEE Conference on High Performance Networking and Computing, November 2006, Article No. 34.
[25] Oral S, Shah G. Spectrum scale enhancements for CORAL. http://files.gpfsug.org/presentations/2016/SC16/11_Sarp_Oral_Gautam_Shah_Spectrum_Scale_Enhancements_for_CORAL_v2.pdf,August 2019.
[26] Kougkas A, Devarajan H, Sun X. Hermes:A heterogeneousaware multi-tiered distributed I/O buffering system. In Proc. the 27th International Symposium on HighPerformance Parallel and Distributed Computing, June 2018, pp.219-230.
[27] Latham R, Ross R B, Thakur R. The impact of file systems on MPI-IO scalability. In Proc. the 11th European PVM/MPI Users' Group Meeting, September 2004, pp.87-96.
[28] Choudhary A, Liao W K, Gao K, Nisar A, Ross R, Thakur R, Latham R. Scalable I/O and analytics. Journal of Physics:Conference Series, 2009, 180(1):Article No. 012048.
[29] Moore M, Bonnie D, Ligon B, Marshall M, Ligon W, Mills N, Quarles E, Sampson S, Yang S, Wilson B. OrangeFS:Advancing PVFS. https://www.usenix.org/legacy/event/fast11/posters_files/Moore.pdf,August 2019.
[30] Ritchie D, Thompson K. The UNIX time-sharing system (reprint). Commun. ACM, 1983, 26(1):84-89.
[31] Vef M A, Tarasov V, Hildebrand D, Brinkmann A. Challenges and solutions for tracing storage systems:A case study with spectrum scale. ACM Trans. Storage, 2018, 14(2):Article No. 18.
[32] Patil S, Gibson G A. Scale and concurrency of GIGA+:File system directories with millions of files. In Proc. the 9th USENIX Conference on File and Storage Technologies, February 2011, pp.177-190.
[33] Ren K, Zheng Q, Patil S, Gibson G A. IndexFS:Scaling file system metadata performance with stateless caching and bulk insertion. In Proc. the 2014 International Conference for High Performance Computing, November 2014, pp.237-248.
[34] Carns P, Yao Y, Harms K, Latham R, Ross R, Antypas K. Production I/O characterization on the Cray XE6. In Proc. the Cray User Group Meeting, May 2013, Article No. 121.
[35] Xing J, Xiong J, Sun N, Ma J. Adaptive and scalable metadata management to support a trillion files. In Proc. the 2009 ACM/IEEE Conference on High Performance Computing, November 2009, Article No. 31.
[36] Frings W, Wolf F, Petkov V. Scalable massively parallel I/O to task-local files. In Proc. the 2009 ACM/IEEE Conference on High Performance Computing, November 2009, Article No. 22.
[37] Yang S, Ligon III W B, Quarles E C. Scalable distributed directory implementation on orange file system. In Proc. the 7th IEEE International Workshop on Storage Network Architecture and Parallel I/Os, May 2011.
[38] Patil S, Ren K, Gibson G. A case for scaling HPC metadata performance through de-specialization. In Proc. the 2012 SC Companion:High Performance Computing, Networking Storage and Analysis, November 2012, pp.30-35.
[39] Carns P H, Ligon III W B, Ross R B, Thakur R. PVFS:A parallel file system for Linux clusters. In Proc. the 4th Annual Linux Showcase & Conference, October 2000, Article No. 4.
[40] Dong S, Callaghan M, Galanis L, Borthakur D, Savor T, Strum M. Optimizing space amplification in RocksDB. In Proc. the 8th Biennial Conference on Innovative Data Systems Research, January 2017, Article No. 30.
[41] Oral S, Dillow D A, Fuller D et al. OLCF's 1 Tb/s, nextgeneration Lustre file system. In Proc. the 2013 Cray User Group Conference, May 2013, Article No. 151.
[42] Lofstead J F, Zheng F, Liu Q, Klasky S, Oldfield R, Kordenbrock T, Schwan K, Wolf M. Managing variability in the IO performance of petascale storage systems. In Proc. the 2010 Conference on High Performance Computing Networking, Storage and Analysis, November 2010, Article No. 35.
[43] Xie B, Chase J S, Dillow D, Drokin O, Klasky S, Oral S, Podhorszki N. Characterizing output bottlenecks in a supercomputer. In Proc. the 2012 International Conference on High Performance Computing Networking, Storage and Analysis, November 2012, Article No. 8.
[44] Kougkas A, Devarajan H, Sun X, Lofstead J F. Harmonia:An interference-aware dynamic I/O scheduler for shared non-volatile burst buffers. In Proc. the 2018 IEEE International Conference on Cluster Computing, September 2018, pp.290-301.
[45] Hashimoto Y, Aida K. Evaluation of performance degradation in HPC applications with VM consolidation. In Proc. the 3rd International Conference on Networking and Computing, December 2012, pp.273-277.
[46] Lofstead J F, Ross R. Insights for exascale IO APIs from building a petascale IO API. In Proc. the 2013 International Conference for High Performance Computing, November 2013, Article No. 87.
[47] Reed D A, Dongarra J J. Exascale computing and big data. Commun. ACM, 2015, 58(7):56-68.
[1] André Brinkmann, Kathryn Mohror, Weikuan Yu, Philip Carns, Toni Cortes, Scott A. Klasky, Alberto Miranda, Franz-Josef Pfreundt, Robert B. Ross, Marc-André Vef. Ad Hoc File Systems for High-Performance Computing [J]. Journal of Computer Science and Technology, 2020, 35(1): 4-26.
[2] Yu-Tong Lu, Peng Cheng, Zhi-Guang Chen. Design and Implementation of the Tianhe-2 Data Storage and Management System [J]. Journal of Computer Science and Technology, 2020, 35(1): 27-46.
[3] Osamu Tatebe, Shukuko Moriwake, Yoshihiro Oyama. Gfarm/BB—Gfarm File System for Node-Local Burst Buffer [J]. Journal of Computer Science and Technology, 2020, 35(1): 61-71.
[4] Anthony Kougkas, Hariharan Devarajan, Xian-He Sun. I/O Acceleration via Multi-Tiered Data Buffering and Prefetching [J]. Journal of Computer Science and Technology, 2020, 35(1): 92-120.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Chen Shihua;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] Li Wanxue;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[4] Feng Yulin;. Recursive Implementation of VLSI Circuits[J]. , 1986, 1(2): 72 -82 .
[5] C.Y.Chung; H.R.Hwa;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[6] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[7] Zhang Cui; Zhao Qinping; Xu Jiafu;. Kernel Language KLND[J]. , 1986, 1(3): 65 -79 .
[8] Wang Jianchao; Wei Daozheng;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[9] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[10] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved