We use cookies to improve your experience with our site.

MPI-RCDD:一种MPI运行时的通信死锁检测框架

MPI-RCDD: A Framework for MPI Runtime Communication Deadlock Detection

  • 摘要: MPI已经成为高性能计算编程模型的事实标准,但其丰富灵活的接口语义使得程序易于出现通信死锁,严重影响系统的可用性。然而,现有的MPI通信死锁检测工具可扩展性不高,难以适应持续扩大的系统规模。为此,本文提出一种MPI运行时的通信死锁检测框架MPI-RCDD,该框架包含三种主要机制。首先,MPI-RCDD设计实现与死锁检测相适应的消息日志协议,确保死锁检测必需的通信消息不丢失;其次,使用MPI环境提供的消息异步处理线程实现进程之间依赖关系的传递,使得众多进程能够同时参与死锁检测工作,缓解集中式分析的性能瓶颈问题;此外,提出一种基于AND⊕OR模型的死锁分析算法AODA,该算法将基于消息超时和基于依赖关系两种死锁分析方式相结合,能够在超时等待进程之间的依赖关系传递过程中搜索死锁环或结,准确定位引发死锁的进程并且不会产生误报。在Umpire Test Suit等典型MPI通信死锁测试程序上的实验结果验证了MPI-RCDD框架的有效性,同时,在NPB基准测试程序上的实验获得了令人满意的性能开销,表明本文提出的MPI-RCDD框架具备较强的可扩展性。

     

    Abstract: The message passing interface (MPI) has become a de facto standard for programming models of highperformance computing, but its rich and flexible interface semantics makes the program easy to generate communication deadlock, which seriously affects the usability of the system. However, the existing detection tools for MPI communication deadlock are not scalable enough to adapt to the continuous expansion of system scale. In this context, we propose a framework for MPI runtime communication deadlock detection, namely MPI-RCDD, which contains three kinds of main mechanisms. Firstly, MPI-RCDD has a message logging protocol that is associated with deadlock detection to ensure that the communication messages required for deadlock analysis are not lost. Secondly, it uses the asynchronous processing thread provided by the MPI to implement the transfer of dependencies between processes, so that multiple processes can participate in deadlock detection simultaneously, thus alleviating the performance bottleneck problem of centralized analysis. In addition, it uses an AND⊕OR model based algorithm named AODA to perform deadlock analysis work. The AODA algorithm combines the advantages of both timeout-based and dependency-based deadlock analysis approaches, and allows the processes in the timeout state to search for a deadlock circle or knot in the process of dependency transfer. Further, the AODA algorithm cannot lead to false positives and can represent the source of the deadlock accurately. The experimental results on typical MPI communication deadlock benchmarks such as Umpire Test Suit demonstrate the capability of MPIRCDD. Additionally, the experiments on the NPB benchmarks obtain the satisfying performance cost, which show that the MPI-RCDD has strong scalability.

     

/

返回文章
返回