计算机科学技术学报 ›› 2023,Vol. 38 ›› Issue (1): 128-145.doi: 10.1007/s11390-023-2907-5

所属专题: Computer Architecture and Systems

• • 上一篇    下一篇

由Slingshot互连的高性能MPI

  

  • 收稿日期:2022-10-16 修回日期:2022-10-29 接受日期:2023-01-05 出版日期:2023-02-28 发布日期:2023-02-28

High Performance MPI over the Slingshot Interconnect

Kawthar Shafie Khorassani, Chen-Chun Chen, Bharath Ramesh, Aamir Shafi, Hari Subramoni, Member, ACM, IEEE, and Dhabaleswar K. Panda, Fellow, IEEE, Member, ACM        

  1. Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, U.S.A.
  • Received:2022-10-16 Revised:2022-10-29 Accepted:2023-01-05 Online:2023-02-28 Published:2023-02-28
  • Contact: Kawthar Shafie Khorassani E-mail:shafiekhorassani.1@osu.edu
  • About author:Kawthar Shafie Khorassani is a Ph.D. student in the Department of Computer Science and Engineering at The Ohio State University, Columbus. She got her Bachelor’s degree in mathematics and computer science at Wayne State University in Detroit, MI. She currently works in the Network Based Computing Laboratory on the MVAPICH2-GDR project. Her research interests lie in high-performance computing (HPC), and in GPU communication and computation.
  • Supported by:
    This work is supported in part by the National Science Foundation of USA under Grant Nos. 1818253, 1854828, 1931537, 200-7991, and XRAC under Grant No. NCR-130002. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

随着 Slingshot 技术在即将到来的百亿亿次系统中的应用,这项由HPE/Cray设计的互连技术正在高性能计算领域中变得越来越重要。值得一提的是,这项互连技术支撑起了世界上第一台排名最高的百亿亿次超级计算机:Frontier。它提供了诸如自适应路由选择、拥塞控制、隔离工作负载等功能。新型互连技术的运用引发了研究者对其性能、可扩展性和任何潜在的瓶颈的关注和研究兴趣,因为它们是在这些系统中的节点间进行扩展的关键因素。本文深入探讨了由Slingshot互连技术和目前最先进的MPI(消息传递接口)库所带来的挑战,尤其是研究了在跨节点间使用Slingshot的可扩展性。我们在Spock系统中进行了一项综合性能评估:在CPU和GPU上,使用不同的MPT和通信库(包括Cray MPICH, Open-MPI + UCX, RCCL, and MVAPICH2)仿真Frontier系统。Spock系统是一个部署了Slingshot-10, AMD MI100 GPU和AMD Epyc Rome CPU的聚簇。我们还初步评估了CPU环境下MPI库对Slingshot-11互连的支持。

关键词: AMD GPU, 互连技术, MPI (信息传递接口), Slingshot

Abstract: The Slingshot interconnect designed by HPE/Cray is becoming more relevant in high-performance computing with its deployment on the upcoming exascale systems. In particular, it is the interconnect empowering the first exascale and highest-ranked supercomputer in the world, Frontier. It offers various features such as adaptive routing, congestion control, and isolated workloads. The deployment of newer interconnects sparks interest related to performance, scalability, and any potential bottlenecks as they are critical elements contributing to the scalability across nodes on these systems. In this paper, we delve into the challenges the Slingshot interconnect poses with current state-of-the-art MPI (message passing interface) libraries. In particular, we look at the scalability performance when using Slingshot across nodes. We present a comprehensive evaluation using various MPI and communication libraries including Cray MPICH, OpenMPI + UCX, RCCL, and MVAPICH2 on CPUs and GPUs on the Spock system, an early access cluster deployed with Slingshot-10, AMD MI100 GPUs and AMD Epyc Rome CPUs to emulate the Frontier system. We also evaluate preliminary CPU-based support of MPI libraries on the Slingshot-11 interconnect.

Key words: AMD GPU, interconnect technology, MPI (message passing interface), Slingshot

<table class="reference-tab" style="background-color:#FFFFFF;width:914.104px;color:#333333;font-family:Calibri, Arial, 微软雅黑, "font-size:16px;"> <tbody> <tr class="document-box" id="b1"> <td valign="top" class="td1"> [1] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Khorassani K S, Chen C C, Ramesh B, Shafi A, Subramoni H, Panda D. High performance MPI over the Slingshot interconnect: Early experiences. In <i>Proc. the 2022 Practice and Experience in Advanced Research Computing</i>, Jul. 2022. DOI: <a href="http://dx.doi.org/10.1145/3491418.3530773">10.1145/3491418.3530773</a>. </div> </td> </tr> <tr class="document-box" id="b2"> <td valign="top" class="td1"> [2] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kim J, Dally W J, Scott S, Abts D. Technology-driven, highly-scalable dragonfly topology. In <i>Proc. the 2008 International Symposium on Computer Architecture</i>, Jun. 2008, pp.77–88. DOI: <a href="http://dx.doi.org/10.1109/ISCA.2008.19">10.1109/ISCA.2008.19</a>. </div> </td> </tr> <tr class="document-box" id="b3"> <td valign="top" class="td1"> [3] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Gabriel E, Fagg G E, Bosilca G, Angskun T, Dongarra J J, Squyres J M, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain R H, Daniel D J, Graham R L, Woodall T S. Open MPI: Goals, concept, and design of a next generation MPI implementation. In <i>Proc. the 11th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface</i>, Sept. 2004, pp.97–104. DOI: <a href="http://dx.doi.org/10.1007/978-3-540-30218-6_19">10.1007/978-3-540-30218-6_19</a>. </div> </td> </tr> <tr class="document-box" id="b4"> <td valign="top" class="td1"> [4] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Thakur R, Rabenseifner R, Gropp W. Optimization of collective communication operations in MPICH. <i>International Journal of High Performance Computing Applications</i>, 2005, 19(1): 49–66. DOI: <a href="http://dx.doi.org/10.1177/1094342005051521">10.1177/1094342005051521</a>. </div> </td> </tr> <tr class="document-box" id="b5"> <td valign="top" class="td1"> [5] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Panda D K, Subramoni H, Chu C H, Bayatpour M. The MVAPICH project: Transforming research into high-performance MPI library for HPC community. <i>Journal of Computational Science</i>, 2021, 52: 101208. DOI: <a href="http://dx.doi.org/10.1016/j.jocs.2020.101208">10.1016/j.jocs.2020.101208</a>. </div> </td> </tr> <tr class="document-box" id="b6"> <td valign="top" class="td1"> [6] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Bureddy D, Wang H, Venkatesh A, Potluri S, Panda D K. OMB-GPU: A micro-benchmark suite for evaluating MPI libraries on GPU clusters. In <i>Proc. the 19th European Conference on Recent Advances in the Message Passing Interface</i>, Sept. 2012, pp.110–120. DOI: <a href="http://dx.doi.org/10.1007/978-3-642-33518-1_16">10.1007/978-3-642-33518-1_16</a>. </div> </td> </tr> <tr class="document-box" id="b7"> <td valign="top" class="td1"> [7] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Chakraborty S, Bayatpour M, Hashmi J, Subramoni H, Panda D K. Cooperative rendezvous protocols for improved performance and overlap. In <i>Proc. the 2018 International Conference for High Performance Computing, Networking, Storage and Analysis</i>, Nov. 2018, pp.361–373. DOI: <a href="http://dx.doi.org/10.1109/SC.2018.00031">10.1109/SC.2018.00031</a>. </div> </td> </tr> <tr class="document-box" id="b8"> <td valign="top" class="td1"> [8] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Khorassani K S, Hashmi J, Chu C H, Chen C C, Subramoni H, Panda D K. Designing a ROCm-aware MPI library for AMD GPUs: Early experiences. In <i>Proc. the 36th International Conference on High Performance Computing</i>, Jun. 24–Jul. 2, 2021, pp.118–136. DOI: <a href="http://dx.doi.org/10.1007/978-3-030-78713-4_7">10.1007/978-3-030-78713-4_7</a>. </div> </td> </tr> <tr class="document-box" id="b9"> <td valign="top" class="td1"> [9] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> De Sensi D, Di Girolamo S, McMahon K H, Roweth D, Hoefler T. An in-depth analysis of the Slingshot interconnect. In <i>Proc. the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis</i>, Nov. 2020. DOI: <a href="http://dx.doi.org/10.1109/SC41405.2020.00039">10.1109/SC41405.2020.00039</a>. </div> </td> </tr> <tr class="document-box" id="b10"> <td valign="top" class="td1"> [10] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Melesse Vergara V G, Budiardja R D, Joubert W. Early experiences evaluating the HPE/Cray ecosystem for AMD GPUs. U.S. Department of Energy, 2021. <a href="https://cug.org/proceedings/cug2021_proceedings/includes/files/pap108s2-file2.pdf">https://cug.org/proceedings/cug2021_proceedings/includes/files/pap108s2-file2.pdf</a>, Jan. 2023. </div> </td> </tr> </tbody> </table>
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 李未;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] 李万学;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[4] C.Y.Chung; 华宣仁;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[5] 孙钟秀; 商陆军;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[6] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[7] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[8] 金兰; 杨元元;. A Modified Version of Chordal Ring[J]. , 1986, 1(3): 15 -32 .
[9] 潘启敬;. A Routing Algorithm with Candidate Shortest Path[J]. , 1986, 1(3): 33 -52 .
[10] 吴恩华;. A Graphics System Distributed across a Local Area Network[J]. , 1986, 1(3): 53 -64 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: