|
计算机科学技术学报 ›› 2020,Vol. 35 ›› Issue (1): 47-60.doi: 10.1007/s11390-020-9798-5
所属专题: Computer Architecture and Systems
Qi Chen1, Kang Chen1, Zuo-Ning Chen2, Fellow, CCF, Wei Xue1, Xu Ji1,3, Bin Yang3,4
Qi Chen1, Kang Chen1, Zuo-Ning Chen2, Fellow, CCF, Wei Xue1, Xu Ji1,3, Bin Yang3,4
在高性能计算机系统中,I/O干扰和存储资源误分配会导致应用I/O性能较难达到存储系统的峰值带宽。但目前的高性能计算机,包括神威太湖之光,不能有效地应对这些问题。我们在神威太湖之光的存储系统中开展了一些列研究来缓解I/O干扰和资源误分配对应用I/O性能的影响。神威太湖之光的存储系统采用I/O转发架构,具有较长I/O路径。为了全面地分析和理解这些问题和它们的关联,我们开发了一套端到端的性能监控和诊断工具。该工具可不仅能针对作业端到端的I/O流分析,还可以进行作业间I/O干扰分析以及存储系统性能瓶颈发现。在这个工具的帮助下,我们发现I/O干扰和资源误分配在转发层和存储层都会出现。在转发层,我们开发了一个应用感知的I/O转发资源调度框架。它利用作业历史执行信息获取应用I/O模式和需求,结合转发资源的使用情况进行按需分配,避免I/O转发节点使用不均衡,避免多个作业的I/O在同一个转发节点上产生冲突。在并行文件系统层,我们提出一种基于性能的数据放置框架。在这种架构中,我们使用资源池抽象来隔离不同应用之间的I/O访问,使用基于性能的数据放置算法实现在存储设备差异的情况下保证作业内并行I/O进程的性能均衡。以上两个工作解决了神威太湖之光存储系统中大部分的I/O干扰和资源误分配问题。此外,我们还针对N-N并行I/O模式的应用,提出了一种轻量级的存储栈来缩短I/O路径、提高应用的存储元数据性能。本文总结了这些工作和在这些过程中获得经验和教训。很多高性能计算系统采用和神威太湖之光相类似的存储架构,我们的工作及经验、教训能为这些架构的存储系统设计和优化提供参考。
[1] Vishwanath V, Hereld M, Iskra K, Kimpe D, Morozov V, Papka M E, Ross R, Yoshii K. Accelerating I/O forwarding in IBM Blue Gene/P systems. In Proc. the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2010, Article No. 34. [2] Ohta K, Kimpe D, Cope J, Iskra K, Ross R, Ishikawa Y. Optimization techniques at the I/O forwarding layer. In Proc. the 2010 IEEE International Conference on Cluster Computing, September 2010, pp.312-321. [3] Ali N, Carns P, Iskra K, Kimpe D, Lang S, Latham R, Ross R, Ward L, Sadayappan P. Scalable I/O forwarding framework for high-performance computing systems. In Proc. the 2009 IEEE International Conference on Cluster Computing and Workshops, August 2009, Article No. 10. [4] Schwan P. Lustre:Building a file system for 1000-node clusters. In Proc. the 2003 Linux Symposium, July 2003, pp.380-386. [5] Lin H, Zhu X, Yu B, Tang X, Xue W, Chen W, Zhang L, Hoefler T, Ma X, Liu X. ShenTu:Processing multi-trillion edge graphs on millions of cores in seconds. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 56. [6] Yang B, Ji X, Ma X et al. End-to-end I/O monitoring on a leading supercomputer. In Proc. the 16th USENIX Symposium on Networked Systems Design and Implementation, February 2019, pp.379-394. [7] Yildiz O, Dorier M, Ibrahim S, Ross R, Antoniu G. On the root causes of cross-application I/O interference in HPC storage systems. In Proc. the 2016 IEEE International Parallel and Distributed Processing Symposium, May 2016, pp.750-759. [8] Gainaru A, Aupy G, Benoit A, Cappello F, Robert Y, Snir M. Scheduling the I/O of HPC applications under congestion. In Proc. the 2015 IEEE International Parallel and Distributed Processing Symposium, May 2015, pp.1013-1022. [9] Valiant L G. A bridging model for parallel computation. Communications of the ACM, 1990, 33(8):103-111. [10] Ji X, Yang B, Zhang T, Ma X, Zhu X, Wang X, El-Sayed N, Zhai J, Liu W, Xue W. Automatic, application-aware I/O forwarding resource allocation. In Proc. the 17th USENIX Conference on File and Storage Technologies, February 2019, pp.265-279. [11] Gunawi H S, Suminto R O, Sears R et al. Fail-slow at scale:Evidence of hardware performance faults in large production systems. ACM Transactions on Storage, 2018, 14(3):Article No. 23. [12] Djordjevic B, Timcenko V. Ext4 file system performance analysis in Linux environment. In Proc. the 11th WSEAS International Conference on Applied Informatics and Communications, August 2011, pp.288-293. [13] Lofstead J, Zheng F, Liu Q, Klasky S, Oldfield R, Kordenbrock T, Schwan K, Wolf M. Managing variability in the IO performance of petascale storage systems. In Proc. the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2010, Article No. 35. [14] Kim Y, Gunasekaran R. Understanding I/O workload characteristics of a Peta-scale storage system. The Journal of Supercomputing, 2015, 71(3):761-780. [15] Dillow D A, Shipman G M, Oral S, Zhang Z, Kim Y. Enhancing I/O throughput via efficient routing and placement for large-scale parallel file systems. In Proc. the 30th IEEE International Performance Computing and Communications Conference, November 2011, Article No. 6. [16] Lockwood G K, Snyder S, Wang T, Byna S, Carns P, Wright N J. A year in the life of a parallel file system. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 74. [17] Shipman G, Dillow D, Oral S, Wang F, Fuller D, Hill J, Zhang Z. Lessons learned in deploying the world's largest scale Lustre file system. In Proc. the 2010 Cray User Group Conference, May 2010. [18] Bent J, Gibson G, Grider G, McClelland B, Nowoczynski P, Nunez J, Polte M, Wingate M. PLFS:A checkpoint filesystem for parallel applications. In Proc. the 2009 Conference on High Performance Computing Networking, Storage and Analysis, November 2009, Article No. 26. [19] Liao W K, Choudhary A. Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols. In Proc. the 2008 ACM/IEEE Conference on Supercomputing, November 2008, Article No. 3. [20] Shan H, Antypas K, Shalf J. Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark. In Proc. the 2008 ACM/IEEE Conference on Supercomputing, November 2008, Article No. 42. [21] Liu Y, Gunasekaran R, Ma X, Vazhkudai S S. Automatic identification of application I/O signatures from noisy server-side traces. In Proc. the 12th USENIX Conference on File and Storage Technologies, February 2014, pp.213-228. [22] Carns P H, Latham R, Ross R B, Iskra K, Lang S, Riley K. 24/7 characterization of petascale I/O workloads. In Proc. the International Conference on Cluster Computing, August 2009, Article No. 75. [23] Conway A, Bakshi A, Jiao Y, Jannen W, Zhan Y, Yuan J, Bender M A, Johnson R, Kuszmaul B C, Porter D E, Farach-Colton M. File systems fated for senescence? Nonsense, says science! In Proc. the 15th USENIX Conference on File and Storage Technologies, February 2017, pp.45-58. [24] Awerbuch B, Scheideler C. Towards a scalable and robust DHT. Theory of Computing Systems, 2009, 45(2):234-260. [25] Qian Y, Li X, Ihara S, Zeng L, Kaiser J, Süß T, Brinkmann A. A configurable rule based classful token bucket filter network request scheduler for the Lustre file system. In Proc. the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 6. [26] Weil S A, Brandt S A, Miller E L, Maltzahn C. CRUSH:Controlled, scalable, decentralized placement of replicated data. In Proc. the 2006 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2016, Article No. 122. [27] Egwutuoha I P, Levy D, Selic B, Chen S. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 2013, 65(3):1302-1326. [28] Artiaga E, Cortes T. Using filesystem virtualization to avoid metadata bottlenecks. In Proc. the 2010 Design, Automation & Test in Europe Conference & Exhibition, March 2010, pp.562-567. [29] Frings W, Wolf F, Petkov V. Scalable massively parallel I/O to task-local files. In Proc. the 2009 Conference on High Performance Computing Networking, Storage and Analysis, November 2009, Article No. 22. [30] Wang T, Mohror K, Moody A, Sato K, Yu W. An ephemeral burst-buffer file system for scientific applications. In Proc. the 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2016, pp.807-818. [31] Vef M A, Moti N, Süß T, Tocci T, Nou R, Miranda A, Cortes T, Brinkmann A. GekkoFS-A temporary distributed file system for HPC applications. In Proc. the 2018 IEEE International Conference on Cluster Computing, September 2018, pp.319-324. [32] Zheng Q, Cranor C D, Guo D, Ganger G R, Amvrosiadis G, Gibson G A, Settlemyer B W, Grider G, Guo F. Scaling embedded in-situ indexing with deltaFS. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2018, Article No. 3. [33] Iskra K, Romein J W, Yoshii K, Beckman P. ZOID:I/Oforwarding infrastructure for petascale architectures. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2008, pp.153-162. [34] Schmuck F B, Haskin R L. GPFS:A shared-disk file system for large computing clusters. In Proc. the 2002 USENIX Conference on File and Storage Technologies, January 2002, pp.231-244. [35] Grandl R, Kandula S, Rao S, Akella A, Kulkarni J. GRAPHENE:Packing and dependency-aware scheduling for data-parallel clusters. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.81-97. [36] Zhou A C, Xiao Y, He B, Ibrahim S, Cheng R. Incorporating probabilistic optimizations for resource provisioning of data processing workflows. In Proc. the 48th International Conference on Parallel Processing, August 2019, Article No. 6. [37] Grandl R, Chowdhury M, Akella A, Ananthanarayanan G. Altruistic scheduling in multi-resource clusters. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.65-80. |
[1] | Jason Liu, Pedro Espina, Xian-He Sun. 关于储存系统建模和优化的综述[J]. 计算机科学技术学报, 2021, 36(1): 71-89. |
[2] | Zhi-Guang Chen, Yu-Bo Liu, Yong-Feng Wang, Yu-Tong Lu. 基于GPU的大规模并行文件系统元数据加速[J]. 计算机科学技术学报, 2021, 36(1): 44-55. |
[3] | Lan Huang, Da-Lin Li, Kang-Ping Wang, Teng Gao, Adriano Tavares. 一个关于高级综合工具性能优化的综述[J]. 计算机科学技术学报, 2020, 35(3): 697-720. |
[4] | Hong-Mei Wei, Jian Gao, Peng Qing, Kang Yu, Yan-Fei Fang, Ming-Lu Li. MPI-RCDD:一种MPI运行时的通信死锁检测框架[J]. 计算机科学技术学报, 2020, 35(2): 395-411. |
[5] | André Brinkmann, Kathryn Mohror, Weikuan Yu, Philip Carns, Toni Cortes, Scott A. Klasky, Alberto Miranda, Franz-Josef Pfreundt, Robert B. Ross, Marc-André Vef. 高性能计算专用文件系统[J]. 计算机科学技术学报, 2020, 35(1): 4-26. |
[6] | Yu-Tong Lu, Peng Cheng, Zhi-Guang Chen. Tianhe-2数据存储与管理系统设计与实现[J]. 计算机科学技术学报, 2020, 35(1): 27-46. |
[7] | Marc-André Vef, Nafiseh Moti, Tim Süß, Markus Tacke, Tommaso Tocci, Ramon Nou, Alberto Miranda, Toni Cortes, André Brinkmann. GekkoFS—一种用于高性能计算应用的临时突发缓冲文件系统[J]. 计算机科学技术学报, 2020, 35(1): 72-91. |
[8] | Robert B. Ross, George Amvrosiadis, Philip Carns, Charles D. Cranor, Matthieu Dorier, Kevin Harms, Greg Ganger, Garth Gibson, Samuel K. Gutierrez, Robert Latham, Bob Robey, Dana Robinson, Bradley Settlemyer, Galen Shipman, Shane Snyder, Jerome Soumagne, Qing Zheng. Mochi:为高性能计算环境组合数据服务[J]. 计算机科学技术学报, 2020, 35(1): 121-144. |
[9] | Min Li, Chao Yang, Qiao Sun Wen-Jing Ma, Wen-Long Cao, Yu-Long Ao. 申威26010处理器上k-means算法的高性能并行[J]. 计算机科学技术学报, 2019, 34(1): 77-93. |
[10] | Xu Tan, Xiao-Wei Shen, Xiao-Chun Ye, Da Wang, Dong-Rui Fan, Lunkai Zhang, Wen-Mi. 一种面向数据流架构的无停顿双缓冲机制[J]. , 2018, 33(1): 145-157. |
[11] | Xiao-Wei Shen, Xiao-Chun Ye, Xu Tan, Da Wang, Lunkai Zhang, Wen-Ming Li, Zhi-Min Zhang, Dong-Rui Fan, Ning-Hui Sun. 一种面向数据流架构的高效片上路由结构[J]. , 2017, 32(1): 11-25. |
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |