Journal of Computer Science and Technology ›› 2020, Vol. 35 ›› Issue (1): 121-144.doi: 10.1007/s11390-020-9802-0

Special Issue: Computer Architecture and Systems

• Special Section on Selected I/O Technologies for High-Performance Computing and Data Analytics • Previous Articles     Next Articles

Mochi: Composing Data Services for High-Performance Computing Environments

Robert B. Ross1, George Amvrosiadis2, Philip Carns1, Charles D. Cranor2, Matthieu Dorier1, Kevin Harms1, Greg Ganger2, Garth Gibson3, Samuel K. Gutierrez4, Robert Latham1, Bob Robey4, Dana Robinson5, Bradley Settlemyer4, Galen Shipman4, Shane Snyder1, Jerome Soumagne5, Qing Zheng2        

  1. 1 Argonne National Laboratory, Lemont, IL 60439, U.S.A;
    2 Parallel Data Laboratory, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A;
    3 Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada;
    4 Los Alamos National Laboratory, Los Alamos NM, U.S.A;
    5 The HDF Group, Champaign IL, U.S.A
  • Received:2019-07-01 Revised:2019-11-02 Online:2020-01-05 Published:2020-01-14
  • About author:Robert B. Ross is a senior computer scientist at Argonne National Laboratory, Lemont, and a senior fellow at the Northwestern-Argonne Institute for Science and Engineering at Northwestern University, Evanston. Dr. Ross's research interests are in system software and architectures for high-performance computing and data analysis systems, in particular storage systems and software for I/O and message passing. Rob received his Ph.D. degree in computer engineering from Clemson University in 2000. Rob was a recipient of the 2004 Presidential Early Career Award for Scientists and Engineers.
  • Supported by:
    This work is in part supported by the Director, Office of Advanced Scientific Computing Research, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357; in part supported by the Exascale Computing Project under Grant No. 17-SC-20-SC, a joint project of the U.S. Department of Energy's Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation's exascale computing imperative; and in part supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through Advanced Computing (SciDAC) program.

Technology enhancements and the growing breadth of application workflows running on high-performance computing (HPC) platforms drive the development of new data services that provide high performance on these new platforms, provide capable and productive interfaces and abstractions for a variety of applications, and are readily adapted when new technologies are deployed. The Mochi framework enables composition of specialized distributed data services from a collection of connectable modules and subservices. Rather than forcing all applications to use a one-size-fits-all data staging and I/O software configuration, Mochi allows each application to use a data service specialized to its needs and access patterns. This paper introduces the Mochi framework and methodology. The Mochi core components and microservices are described. Examples of the application of the Mochi methodology to the development of four specialized services are detailed. Finally, a performance evaluation of a Mochi core component, a Mochi microservice, and a composed service providing an object model is performed. The paper concludes by positioning Mochi relative to related work in the HPC space and indicating directions for future work.

Key words: storage and I/O; data-intensive computing; distributed services; high-performance computing;

[1] Venkatesan S, Aoulaiche M. Overview of 3D NAND technologies and outlook invited paper. In Proc. the 2018 NonVolatile Memory Technology Symposium, Oct. 2018, Article No. 15.
[2] Hady F T, Foong A, Veal B, Williams D. Platform storage performance with 3D XPoint technology. Proceedings of the IEEE, 2017, 105(9):1822-1833.
[3] Kim J, Dally W J, Scott S, Abts D. Technology-driven, highly-scalable dragonfly topology. ACM SIGARCH Comput. Architecture News, 2008, 36(3):77-88.
[4] Besta M, Hoeer T. Slim Fly:A cost effective low-diameter network topology. In Proc. the Int. Conf. for High Performance Comput., Networking, Storage and Anal., November 2014, pp.348-359.
[5] Flajslik M, Borch E, Parker M A. Megafly:A topology for exascale systems. In Proc. the 33rd International Conference on High Performance Computing, June 2018, pp.289-310.
[6] Shpiner A, Haramaty Z, Eliad S, Zdornov V, Gafni B, Zahavi E. Dragonfly+:Low cost topology for scaling datacenters. In Proc. the 3rd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era, February 2017, pp.1-8.
[7] Sivaraman G, Beard E, Vazquez-Mayagoitia A, Vishwanath V, Cole J. UV/vis absorption spectra database autogenerated for optical applications via the Argonne data science program. In Proc. the 2019 APS March Meeting, March 2019.
[8] Lockwood G K, Hazen D, Koziol Q et al. Storage 2020:A vision for the future of HPC storage. Technical Report, National Energy Research Scientific Computing Center, 2017. https://escholarship.org/content/qt744479dp/qt744479dp.pdf,Sept.2019.
[9] Seo S, Amer A, Balaji P et al. Argobots:A lightweight lowlevel threading and tasking framework. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(3):512-526.
[10] Soumagne J, Kimpe D, Zounmevo J, Chaarawi M, Koziol Q, Afsahi A, Ross R. Mercury:Enabling remote procedure call for high-performance computing. In Proc. the 2013 IEEE International Conference on Cluster Computing, September 2013, Article No. 50.
[11] Das A, Gupta I, Motivala A. SWIM:Scalable weaklyconsistent infection-style process group membership protocol. In Proc. the 2002 International Conference on Dependable Systems and Networks, June 2002, pp.303-312.
[12] Rudoff A. Persistent memory programming. Login:The Usenix Magazine, 2017, 42(2):34-40.
[13] Carns P, Jenkins J, Cranor C, Atchley S, Seo S, Snyder S, Hoeer T, Ross R. Enabling NVM for data-intensive scientific services. In Proc. the 4th Workshop on Interactions of NVM/Flash with Operating Systems and Workloads, November 2016, Article No. 4.
[14] Ghemawat S, Dean J. LevelDB-A fast and lightweight key/value database library by Google. https://github.com/google/leveldb,Sept.2019.
[15] Olson M A, Bostic K, Seltzer M I. Berkeley DB. In Proc. the 1999 USENIX Annual Technical Conference, June 1999, pp.183-191.
[16] Dorier M, Carns P, Harms K et al. Methodology for the rapid development of scalable HPC data services. In Proc. the 3rd Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, November 2018, pp.76-87.
[17] van der Walt S, Colbert S C, Varoquaux G. The NumPy array:A structure for efficient numerical computation. Computing in Science & Engineering, 2011, 13(2):22-30.
[18] Rosenblum M, Ousterhout J K. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems, 1992, 10(1):26-52.
[19] Brun R, Rademakers F. ROOT-An object oriented data analysis framework. Nuclear Instruments and Methods in Physics Research Section A:Accelerators, Spectrometers, Detectors and Associated Equipment, 1997, 389(1/2):81-86.
[20] Perez D, Cubuk E D, Waterland A, Kaxiras E, Voter A F. Long-time dynamics through parallel trajectory splicing. Journal of Chemical Theory and Computation, 2015, 12(1):18-28.
[21] Sevilla M A, Maltzahn C, Alvaro P, Nasirigerdeh R, Settlemyer B W, Perez D, Rich D, Shipman G M. Programmable caches with a data management language and policy engine. In Proc. the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2018, pp.203-212.
[22] Zheng Q, Cranor C D, Guo D H, Ganger G R, Amvrosiadis G, Gibson G A, Settlemyer B W, Grider G, Guo F. Scaling embedded in-situ indexing with deltaFS. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2018, Article No. 3.
[23] Greenberg H, Bent J, Grider G. MDHIM:A parallel key/value framework for HPC. In Proc. the 7th USENIX Workshop on Hot Topics in Storage and File Systems, July 2015, Article No. 10.
[24] Weil S A, Leung A W, Brandt S A, Maltzahn C. RADOS:A scalable, reliable storage service for petabyte-scale storage clusters. In Proc. the 2nd International Petascale Data Storage Workshop, November 2007, pp.35-44.
[25] Weil S A, Brandt S A, Miller E L, Long D D E, Maltzahn C. Ceph:A scalable, high-performance distributed file system. In Proc. the 7th USENIX Symposium on Operating Systems Design and Implementation, November 2006, pp.307-320.
[26] Liu J L, Koziol Q, Butler G F, Fortner N, Chaarawi M, Tang H J, Byna S, Lockwood G K, Cheema R, Kallback-Rose K A, Hazen D, Prabhat. Evaluation of HPC application I/O on object storage systems. In Proc. the 3rd IEEE/ACM International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, November 2018, pp.24-34.
[27] Escriva R, Sirer E G. The design and implementation of the warp transactional file system. In Proc. the 13th USENIX Symposium on Networked Systems Design and Implementation, March 2016, pp.469-483.
[28] Kunkel J, Betke E. An MPI-IO in-memory driver for nonvolatile pooled memory of the Kove XPD. In Proc. the 2017 International Workshops on High Performance Computing, June 2017, pp.679-690.
[29] Latham R, Ross R B, Thakur R. Can MPI be used for persistent parallel services? In Proc. the 13th European PVM/MPI Users' Group Meeting, September 2006, pp.275-284.
[30] Vef M A, Moti N, Süß T, Tocci T, Nou R, Miranda A, Cortes T, Brinkmann A. GekkoFS-A temporary distributed file system for HPC applications. In Proc. the 2018 IEEE International Conference on Cluster Computing, September 2018, pp.319-324.
[31] Wang T, Mohror K, Moody A, Sato K, Yu W K. An ephemeral burst-buffer file system for scientific applications. In Proc. the 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2016, pp.807-818.
[32] Tang H J, Byna S, Tessier F et al. Toward scalable and asynchronous object-centric data management for HPC. In Proc. the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2018, pp.113-122.
[33] Intel Corporation. DAOS:Revolutionizing high-performance storage with Intel Optane technology. https://www.intel.com/content/dam/www/public/us/en/documents/solution-briefs/high-performance-storage-brief.pdf,June 2019.
[34] Zhao D F, Zhang Z, Zhou X B, Li T L, Wang K, Kimpe D, Carns P, Ross R, Raicu I. FusionFS:Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems. In Proc. the 2014 IEEE International Conference on Big Data, October 2014, pp.61-70.
[35] Docan C, Parashar M, Klasky S. DataSpaces:An interaction and coordination framework for coupled simulation workflows. Cluster Computing, 2011, 15(2):163-181.
[36] Docan C, Parashar M, Klasky S. Enabling high-speed asynchronous data extraction and transfer using DART. Concurrency and Computation:Practice and Experience, 2010, 22(9):1181-1204.
[37] Duro F R, Blas J G, Isaila F, Pérez J C, Wozniak J M, Ross R. Exploiting data locality in Swift/T workflows using Hercules. In Proc. the 1st Network for Sustainable Ultrascale Computing Workshop, October 2014.
[38] Fitzpatrick B. Distributed caching with Memcached. Linux Journal, 2004, 2004(124):72-76.
[39] Kim J, Lee S, Vetter J S. PapyrusKV:A high-performance parallel key-value store for distributed NVM architectures. In Proc. the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 57.
[40] Frings W, Ahn D H, LeGendre M, Gamblin T, de Supinski B R, Wolf F. Massively parallel loading. In Proc. the 27th International ACM Conference on International Conference on Supercomputing, June 2013, pp.389-398.
[41] Kougkas A, Devarajan H, Lofstead J, Sun X H. LABIOS:A distributed label-based I/O system. In Proc. the 28th International Symposium on High-Performance Parallel and Distributed Computing, June 2019, pp.13-24.
[42] Anwar A, Cheng Y, Huang H, Han J, Sim H, Lee D, Douglis F, Butt A R. BESPOKV:Application tailored scale-out key-value stores. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2018, Article No. 2.
[43] Ulmer C, Mukherjee S, Templet G, Levy S, Lofstead J, Widener P, Kordenbrock T, Lawson M. Faodel:Data management for next-generation application workflows. In Proc. the 9th Workshop on Scientific Cloud Computing, June 2018, Article No. 8.
[44] Sevilla M A, Watkins N, Jimenez I, Alvaro P, Finkelstein S, LeFevre J, Maltzahn C. Malacology:A programmable storage system. In Proc. the 12th European Conference on Computer Systems, April 2017, pp.175-190.
[1] Hong-Mei Wei, Jian Gao, Peng Qing, Kang Yu, Yan-Fei Fang, Ming-Lu Li. MPI-RCDD: A Framework for MPI Runtime Communication Deadlock Detection [J]. Journal of Computer Science and Technology, 2020, 35(2): 395-411.
[2] André Brinkmann, Kathryn Mohror, Weikuan Yu, Philip Carns, Toni Cortes, Scott A. Klasky, Alberto Miranda, Franz-Josef Pfreundt, Robert B. Ross, Marc-André Vef. Ad Hoc File Systems for High-Performance Computing [J]. Journal of Computer Science and Technology, 2020, 35(1): 4-26.
[3] Yu-Tong Lu, Peng Cheng, Zhi-Guang Chen. Design and Implementation of the Tianhe-2 Data Storage and Management System [J]. Journal of Computer Science and Technology, 2020, 35(1): 27-46.
[4] Marc-André Vef, Nafiseh Moti, Tim Süß, Markus Tacke, Tommaso Tocci, Ramon Nou, Alberto Miranda, Toni Cortes, André Brinkmann. GekkoFS—A Temporary Burst Buffer File System for HPC Applications [J]. Journal of Computer Science and Technology, 2020, 35(1): 72-91.
[5] Xu Tan, Xiao-Wei Shen, Xiao-Chun Ye, Da Wang, Dong-Rui Fan, Lunkai Zhang, Wen-Ming Li, Zhi-Min Zhang, Zhi-Min Tang. A Non-Stop Double Buffering Mechanism for Dataflow Architecture [J]. , 2018, 33(1): 145-157.
[6] Xiao-Wei Shen, Xiao-Chun Ye, Xu Tan, Da Wang, Lunkai Zhang, Wen-Ming Li, Zhi-Min Zhang, Dong-Rui Fan, Ning-Hui Sun. An Efficient Network-on-Chip Router for Dataflow Architecture [J]. , 2017, 32(1): 11-25.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Sun Zhongxiu; Shang Lujun;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[2] Zheng Guoliang; Li Hui;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[3] Zhang Bo; Zhang Ling;. Statistical Heuristic Search[J]. , 1987, 2(1): 1 -11 .
[4] Qiao Xiangzhen;. An Efficient Parallel Algorithm for FFT[J]. , 1987, 2(3): 174 -190 .
[5] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .
[6] S. T. Chanson; L. Liang; A. Kumar. Throughput Models of CSMA Network with Stations Uniformly Distributed along the Bus[J]. , 1987, 2(4): 243 -264 .
[7] Shi Weigeng; StephenY.H.Su;. An Online Diagnosable Fault-Tolerant Redundancy System[J]. , 1987, 2(4): 310 -321 .
[8] Jin Lingzi; Zhu Hong;. Systems Programming in the Functional Language FP[J]. , 1988, 3(1): 40 -55 .
[9] Zhang Yan; He Jichao;. Data Dependencies in Database with Incomplete Information[J]. , 1988, 3(2): 131 -138 .
[10] Chen Guoliang;. A Partitioning Selection Algorithm on Multiprocessors[J]. , 1988, 3(4): 241 -250 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved