We use cookies to improve your experience with our site.

神威存储系统面向应用I/O性能提升的优化介绍

Lessons Learned from Optimizing the Sunway Storage System for Higher Application I/O Performance

  • 摘要: 在高性能计算机系统中,I/O干扰和存储资源误分配会导致应用I/O性能较难达到存储系统的峰值带宽。但目前的高性能计算机,包括神威太湖之光,不能有效地应对这些问题。我们在神威太湖之光的存储系统中开展了一些列研究来缓解I/O干扰和资源误分配对应用I/O性能的影响。神威太湖之光的存储系统采用I/O转发架构,具有较长I/O路径。为了全面地分析和理解这些问题和它们的关联,我们开发了一套端到端的性能监控和诊断工具。该工具可不仅能针对作业端到端的I/O流分析,还可以进行作业间I/O干扰分析以及存储系统性能瓶颈发现。在这个工具的帮助下,我们发现I/O干扰和资源误分配在转发层和存储层都会出现。在转发层,我们开发了一个应用感知的I/O转发资源调度框架。它利用作业历史执行信息获取应用I/O模式和需求,结合转发资源的使用情况进行按需分配,避免I/O转发节点使用不均衡,避免多个作业的I/O在同一个转发节点上产生冲突。在并行文件系统层,我们提出一种基于性能的数据放置框架。在这种架构中,我们使用资源池抽象来隔离不同应用之间的I/O访问,使用基于性能的数据放置算法实现在存储设备差异的情况下保证作业内并行I/O进程的性能均衡。以上两个工作解决了神威太湖之光存储系统中大部分的I/O干扰和资源误分配问题。此外,我们还针对N-N并行I/O模式的应用,提出了一种轻量级的存储栈来缩短I/O路径、提高应用的存储元数据性能。本文总结了这些工作和在这些过程中获得经验和教训。很多高性能计算系统采用和神威太湖之光相类似的存储架构,我们的工作及经验、教训能为这些架构的存储系统设计和优化提供参考。

     

    Abstract: It is hard for applications to make full utilization of the peak bandwidth of the storage system in highperformance computers because of I/O interferences, storage resource misallocations and complex long I/O paths. We performed several studies to bridge this gap in the Sunway storage system, which serves the supercomputer Sunway TaihuLight. To locate these issues and connections between them, an end-to-end performance monitoring and diagnosis tool was developed to understand I/O behaviors of applications and the system. With the help of the tool, we were about to find out the root causes of such performance barriers at the I/O forwarding layer and the parallel file system layer. An application-aware I/O forwarding allocation framework was used to address the I/O interferences and resource misallocations at the I/O forwarding layer. A performance-aware data placement mechanism was proposed to mitigate the impact of I/O interferences and performance variations of storage devices in the PFS. Together, applications obtained much better I/O performance. During the process, we also proposed a lightweight storage stack to shorten the I/O path of applications with N-N I/O pattern. This paper summarizes these studies and presents the lessons learned from the process.

     

/

返回文章
返回