计算机科学技术学报 ›› 2023,Vol. 38 ›› Issue (1): 166-195.doi: 10.1007/s11390-023-2894-6

所属专题: 综述 Computer Architecture and Systems Artificial Intelligence and Pattern Recognition

• • 上一篇    下一篇

xCCL: 对工业界深度学习集合通信库的综述

  

  • 收稿日期:2022-10-08 修回日期:2022-11-09 接受日期:2023-01-03 出版日期:2023-02-28 发布日期:2023-02-28

xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning

Adam Weingram, Yuke Li (李雨珂), Student Member, ACM, Hao Qi (戚昊), Darren Ng, Liuyao Dai (代柳瑶), and Xiaoyi Lu (鲁小亿), Member, ACM, IEEE        

  1. Department of Computer Science and Engineering, University of California, Merced, Merced 95343, U.S.A.
  • Received:2022-10-08 Revised:2022-11-09 Accepted:2023-01-03 Online:2023-02-28 Published:2023-02-28
  • Contact: Adam Weingram E-mail:aweingram@ucmerced.edu
  • About author:Adam Weingram is a Ph.D. student in the Parallel and Distributed Systems Laboratory (PADSYS Lab) of Department of Computer Science and Engineering at the University of California, Merced (UCM). Previously, he received his B.S. degree in computer science from UCM. His research interests include systems for machine learning and applications of computer science in remote sensing.
  • Supported by:
    This work was supported in part by the National Science Foundation of USA under Grant No. CCF #2132049, a Google Research Award, and a Meta Faculty Research Award. This work used the Expanse cluster at SDSC (San Diego Supercomputer Center) through allocation CIS210053 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation of USA under Grant Nos. 2138259, 2138286, 2138307, 2137603, and 2138296.

1、研究背景
深度学习技术在工业界和学术界已经被广泛地使用。越来越多大规模深度学习模型的使用和大量的数据通信推动着分布式训练的性能不断提升。集合通信的使用将分布式训练中多个硬件之间的数据通信变得简洁和高效。与此同时,集合通信也成为了分布式机器学习中不可或缺的一部分。集合通信由通信库来实现,除了在高性能计算领域被广泛使用的信息传递接口(Message Passing Interface,MPI)通信库之外,工业界针对深度学习的技术特点和硬件特性,对集合通信库(Collective Communication Library or CCL)进行了专门的优化,并提供了各自不同的通信库实现。
2、目的
从学术界广泛使用的MPI通信库到工业界提供的各自不同的集合通信库,我们产生了如下的思考:1. 为什么各个企业重复实现各自的通信库,而不是使用学术界经典的MPI通信库?2. 这些不同的通信库之间的性能有何差异?3. 这些通信库的设计结构是什么样的?这些设计是否有相同之处?以及为什么某些设计结构是相同的?
为了回答这些问题,在本篇论文中,我们调研了各个企业所提供的,在深度学习领域被广泛使用的通信库(在文章中我们统一称之为xCCL),包括NCCL,oneCCL,RCCL,MSCCL,ACCL,和Gloo。我们从通信库的设计和使用场景进行讨论和研究,并设计了实验对一些通信库的性能进行比较,从而得出我们的观察和结论,并为后续的通信库的设计提供参考及思路。
3、方法
本篇论文的结构如下:第二章介绍了集合通信中的通信模式;第三章和第四章分别描述了常见的硬件网络拓扑和通信算法;第五章介绍了集合通信在深度学习训练中的应用场景和作用,以及一些来自业界的使用案例分析;第六章调研了业界提供的不同的通信库以及他们各自的特性;第七章为实验章节,我们选取了一些通信库,通过基准测试的方式,对比了它们之间的性能差异;第八章探讨了我们通过调研和实验所观察到的现象和提出的讨论;第九章介绍了与本论文相关的工作;第十章总结本篇论文。
4、结果
(1) 为什么xCCL比传统的MPI更有吸引力?首先,深度学习通常部署在GPU上,业界更倾向于针对这种特定的硬件去优化通信库;其次,业界提供的针对特定硬件的通信库更轻量化,针对深度学习进行设计的通信库使得相关的开发者更轻松地将其通信库整合进自己的代码中;最后,相比起xCCL,传统的MPI对于特定硬件(比如GPU)的性能优化要弱于xCCL。
(2) 哪一个通信库可以提供最佳的性能? 现阶段的实验表明,NCCL有着较好的性能。但与此同时,开发者们也在不断地改进各自的通信库,不断地优化其性能。例如,MSCCL允许使用者根据其程序的通信特征来对通信算法进行重新设计,以达到针对特定程序获得性能最优的目的。
(3) 不同通信库之间的设计异同:从通信模式的角度来看,由于深度学习中数据通信的模型和传统MPI中所提供的通信模型有相似之处,以及许多xCCLs是基于英伟达提供的NCCL通信库所设计的,所以xCCL沿用了MPI所定义的集合通信模型,即All-Reduce和Broadcast。但不同的xCCL有着不同的特性,例如,它们有着不同的开源规定,以及针对不同型号的GPU进行了优化。
(4) 现在常用的网络性能是否会成为xCCLs的性能瓶颈? 虽然网络带宽可能会对通信库的性能有所影响,但通过合理的设计,可以使通信库更为高效地利用有限的带宽。同时,各个厂商也在不断地设计和优化自己专用的硬件网络,从而达到更高的网络带宽和更快的数据传输速度。
5、结论
本篇论文对深度学习领域中常用的集合通信库(xCCL)进行了广泛的调研。我们从跨节点通信的基本原理和网络拓扑开始,讨论了集合通信中使用的数据传输算法,通过比较不同企业所提供的设计方案和真实案例分析来探索各种通信库的异同。我们通过对业界提供的两个通信库(NCCL和MSCCL)的基准测试来评估xCCL的性能,并对测试结果进行了分析。我们还讨论了为什么在存在经典的MPI通信库的情况下,xCCL在业界越来越受到重视,并进一步解释了这些通信库如何利用硬件加速器和高速互联网络来支持大规模深度学习模型训练。通过我们的调研,我们认为NCCL是目前最成熟的集合通信库。我们希望未来可以对NCCL通信库进行更深度的研究,从而将其优化有效地应用于其他通信库中。

关键词: 集合通信, 深度学习, 分布式训练, GPU直接访问, 远程直接内存访问

Abstract: Machine learning techniques have become ubiquitous both in industry and academic applications. Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches. Collective communications greatly simplify inter- and intra-node data transfer and are an essential part of the distributed training process as information such as gradients must be shared between processing nodes. In this paper, we survey the current state-of-the-art collective communication libraries (namely xCCL, including NCCL, oneCCL, RCCL, MSCCL, ACCL, and Gloo), with a focus on the industry-led ones for deep learning workloads. We investigate the design features of these xCCLs, discuss their use cases in the industry deep learning workloads, compare their performance with industry-made benchmarks (i.e., NCCL Tests and PARAM), and discuss key take-aways and interesting observations. We believe our survey sheds light on potential research directions of future designs for xCCLs.

Key words: collective, deep learning, distributed training, GPUDirect, RDMA (remote direct memory access)

<table class="reference-tab" style="background-color:#FFFFFF;width:914.104px;color:#333333;font-family:Calibri, Arial, 微软雅黑, "font-size:16px;"> <tbody> <tr class="document-box" id="b1"> <td valign="top" class="td1"> [1] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hwang K, Xu Z W. Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill, 1998. </div> </td> </tr> <tr class="document-box" id="b2"> <td valign="top" class="td1"> [2] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Brown T B, Mann B, Ryder N et al. Language models are few-shot learners. In <i>Proc</i>. <i>the 34th Int. Conf. Neural Information Processing Systems</i>, Dec. 2020, pp.1877-1901. </div> </td> </tr> <tr class="document-box" id="b3"> <td valign="top" class="td1"> [3] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Naumov M, Mudigere D, Shi H J M et al. Deep learning recommendation model for personalization and recommendation systems. arXiv: 1906.00091, 2019. <a href="https://arxiv.org/abs/1906.00091">https://arxiv.org/abs/1906.00091</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b4"> <td valign="top" class="td1"> [4] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Bayatpour M, Chakraborty S, Subramoni H, Lu X Y, Panda D K. Scalable reduction collectives with data partitioning-based multi-leader design. In <i>Proc</i>. <i>the 2017 Int. Conf. High Performance Computing</i>, <i>Networking</i>, <i>Storage and Analysis</i> (<i>SC</i>), Nov. 2017. DOI: <a href="https://doi.org/10.1145/3126908.3126954">10.1145/3126908.3126954</a>. </div> </td> </tr> <tr class="document-box" id="b5"> <td valign="top" class="td1"> [5] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Chu C H, Lu X Y, Awan A A, Subramoni H, Hashmi J, Elton B, Panda D K. Efficient and scalable multi-source streaming broadcast on GPU clusters for deep learning. In <i>Proc</i>. <i>the 46th Int. Conf. Parallel Processing</i> (<i>ICPP</i>), Aug. 2017, pp.161-170. DOI: <a href="https://doi.org/10.1109/ICPP.2017.25">10.1109/ICPP.2017.25</a>. </div> </td> </tr> <tr class="document-box" id="b6"> <td valign="top" class="td1"> [6] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Panda D K, Lu X Y, Shankar D. High-Performance Big Data Computing. The MIT Press, 2022. </div> </td> </tr> <tr class="document-box" id="b7"> <td valign="top" class="td1"> [7] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lu X Y, Islam N S, Wasi-Ur-Rahman et al. High-performance design of Hadoop RPC with RDMA over InfiniBand. In <i>Proc</i>. <i>the 42nd ICPP</i>, Oct. 2013, pp.641-650. DOI: <a href="https://doi.org/10.1109/ICPP.2013.78">10.1109/ICPP.2013.78</a>. </div> </td> </tr> <tr class="document-box" id="b8"> <td valign="top" class="td1"> [8] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Wasi-Ur-Rahman, Lu X Y, Islam N S, Panda D K. HOMR: A hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects. In <i>Proc</i>. <i>the 28th ACM Int. Conf. Supercomputing</i> (<i>ICS</i>), Jun. 2014, pp.33-42. DOI: <a href="https://doi.org/10.1145/2597652.2597684">10.1145/2597652.2597684</a>. </div> </td> </tr> <tr class="document-box" id="b9"> <td valign="top" class="td1"> [9] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Islam N S, Lu X Y, Wasi-Ur-Rahman, Panda D K. SOR-HDFS: A SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS. In <i>Proc</i>. <i>the 23rd Int. Symp. High–Performance Parallel and Distributed Computing</i>, Jun. 2014, pp.261-264. DOI: <a href="https://doi.org/10.1145/2600212.2600715">10.1145/2600212.2600715</a>. </div> </td> </tr> <tr class="document-box" id="b10"> <td valign="top" class="td1"> [10] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lu X Y, Shankar D, Gugnani S, Panda D K. High-performance design of Apache Spark with RDMA and its benefits on various workloads. In <i>Proc</i>. <i>the 2016 IEEE Int. Conf. Big Data</i>, Dec. 2016, pp.253-262. DOI: <a href="https://doi.org/10.1109/BigData.2016.7840611">10.1109/BigData.2016.7840611</a>. </div> </td> </tr> <tr class="document-box" id="b11"> <td valign="top" class="td1"> [11] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kalia A, Kaminsky M, Andersen D G. Using RDMA efficiently for key-value services. In <i>Proc</i>. <i>the 2014 ACM Conference on SIGCOMM</i>, Aug. 2014, pp.295-306. DOI: <a href="https://doi.org/10.1145/2619239.2626299">10.1145/2619239.2626299</a>. </div> </td> </tr> <tr class="document-box" id="b12"> <td valign="top" class="td1"> [12] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shankar D, Lu X Y, Panda D K. SCOR-KV: SIMD-aware client-centric and optimistic RDMA-based key-value store for emerging CPU architectures. In <i>Proc</i>. <i>the 2019 SC</i>, Dec. 2019, pp.257-266. DOI: <a href="https://doi.org/10.1109/HiPC.2019.00040">10.1109/HiPC.2019.00040</a>. </div> </td> </tr> <tr class="document-box" id="b13"> <td valign="top" class="td1"> [13] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Dragojević A, Narayanan D, Hodson O, Castro M. FaRM: Fast remote memory. In <i>Proc</i>. <i>the 11th USENIX Symposium on Networked Systems Design and Implementation</i>, Apr. 2014, pp.401-414. </div> </td> </tr> <tr class="document-box" id="b14"> <td valign="top" class="td1"> [14] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shankar D, Lu X Y, Islam N, Wasi-Ur-Rahman, Panda D K. High-performance hybrid key-value store on modern clusters with RDMA interconnects and SSDs: Non-blocking extensions, designs, and benefits. In <i>Proc</i>. <i>the 2016 IEEE International Parallel and Distributed Processing Symposium</i> (<i>IPDPS</i>), May 2016, pp.393-402. DOI: <a href="https://doi.org/10.1109/IPDPS.2016.112">10.1109/IPDPS.2016.112</a>. </div> </td> </tr> <tr class="document-box" id="b15"> <td valign="top" class="td1"> [15] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Gugnani S, Lu X Y, Panda D K. Swift-X: Accelerating OpenStack swift with RDMA for building an efficient HPC cloud. In <i>Proc</i>. <i>the 17th IEEE/ACM International Symposium on Cluster</i>, <i>Cloud and Grid Computing</i>, May 2017, pp.238-247. DOI: <a href="https://doi.org/10.1109/CCGRID.2017.103">10.1109/CCGRID.2017.103</a>. </div> </td> </tr> <tr class="document-box" id="b16"> <td valign="top" class="td1"> [16] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Gugnani S, Lu X Y, Panda D K. Designing virtualization-aware and automatic topology detection schemes for accelerating Hadoop on SR-IOV-enabled clouds. In <i>Proc</i>. <i>the 2016 IEEE Int. Conf. Cloud Computing Technology and Science</i>, Dec. 2016, pp.152-159. DOI: <a href="https://doi.org/10.1109/CloudCom.2016.0037">10.1109/CloudCom.2016.0037</a>. </div> </td> </tr> <tr class="document-box" id="b17"> <td valign="top" class="td1"> [17] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang J, Lu X Y, Panda D K. Designing locality and NUMA aware MPI runtime for nested virtualization based HPC cloud with SR-IOV enabled InfiniBand. In <i>Proc</i>. <i>the 13th ACM SIGPLAN/SIGOPS Int. Conf. Virtual ution Environments</i>, Apr. 2017, pp.187-200. DOI: <a href="https://doi.org/10.1145/3050748.3050765">10.1145/3050748.3050765</a>. </div> </td> </tr> <tr class="document-box" id="b18"> <td valign="top" class="td1"> [18] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Chu C H, Lu X Y, Awan A A <i>et al</i>. Exploiting hardware multicast and GPUDirect RDMA for efficient broadcast. <i>IEEE Trans. Parallel and Distributed Systems</i>, 2019, 30(3): 575–588. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/TPDS.2018.2867222" target="_blank">10.1109/TPDS.2018.2867222</a>. </div> </td> </tr> <tr class="document-box" id="b19"> <td valign="top" class="td1"> [19] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang J, Lu X Y, Chu C H, Panda D K. C-GDR: High-performance container-aware GPUDirect MPI communication schemes on RDMA networks. In <i>Proc</i>. <i>the 2019 IPDPS</i>, May 2019, pp.242-251. DOI: <a href="https://doi.org/10.1109/IPDPS.2019.00034">10.1109/IPDPS.2019.00034</a>. </div> </td> </tr> <tr class="document-box" id="b20"> <td valign="top" class="td1"> [20] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Li Y K, Qi H, Lu G, Jin F, Guo Y F, Lu X Y. Understanding hot interconnects with an extensive benchmark survey. <i>BenchCouncil Trans. Benchmarks, Standards and Evaluations</i>, 2022, 2(3): 100074. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1016/J.TBENCH.2022.100074" target="_blank">10.1016/J.TBENCH.2022.100074</a>. </div> </td> </tr> <tr class="document-box" id="b21"> <td valign="top" class="td1"> [21] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pacheco P. An Introduction to Parallel Programming. Elsevier, 2011. DOI: <a href="https://doi.org/10.1016/C2009-0-18471-4">10.1016/C2009-0-18471-4</a>. </div> </td> </tr> <tr class="document-box" id="b22"> <td valign="top" class="td1"> [22] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Gong Y F, He B S, Zhong J L. Network performance aware MPI collective communication operations in the cloud. <i>IEEE Trans. Parallel and Distributed Systems</i>, 2015, 26(11): 3079–3089. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/TPDS.2013.96" target="_blank">10.1109/TPDS.2013.96</a>. </div> </td> </tr> <tr class="document-box" id="b23"> <td valign="top" class="td1"> [23] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Brown K A, Domke J, Matsuoka S. Hardware-centric analysis of network performance for MPI applications. In <i>Proc</i>. <i>the 21st IEEE Int. Conf. Parallel and Distributed Systems</i> (<i>ICPADS</i>), Dec. 2015, pp.692-699. DOI: <a href="https://doi.org/10.1109/ICPADS.2015.92">10.1109/ICPADS.2015.92</a>. </div> </td> </tr> <tr class="document-box" id="b24"> <td valign="top" class="td1"> [24] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Katseff H P. Incomplete hypercubes. <i>IEEE Trans. Computers</i>, 1988, 37(5): 604–608. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/12.4611" target="_blank">10.1109/12.4611</a>. </div> </td> </tr> <tr class="document-box" id="b25"> <td valign="top" class="td1"> [25] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kalb J L, Lee D S. Network topology analysis. Technical Report SAND2008-0069. Sandia National Laboratories, Albuquerque, New Mexico, 2008. <a href="https://digital.library.unt.edu/ark:/67531/metadc845229/m2/1/high_res_d/1028919.pdf">https://digital.library.unt.edu/ark:/67531/metadc845229/m2/1/high_res_d/1028919.pdf</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b26"> <td valign="top" class="td1"> [26] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kim J, Kim H. Router microarchitecture and scalability of ring topology in on-chip networks. In <i>Proc</i>. <i>the 2nd Int. Workshop on Network on Chip Architectures</i>, Dec. 2009, pp.5-10. DOI: <a href="https://doi.org/10.1145/1645213.1645217">10.1145/1645213.1645217</a>. </div> </td> </tr> <tr class="document-box" id="b27"> <td valign="top" class="td1"> [27] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Bouknight W J, Denenberg S A, McIntyre D E, Randall J M, Sameh A H, Slotnick D L. The Illiac IV system. <i>Proceedings of the IEEE</i>, 1972, 60(4): 369–388. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/PROC.1972.8647" target="_blank">10.1109/PROC.1972.8647</a>. </div> </td> </tr> <tr class="document-box" id="b28"> <td valign="top" class="td1"> [28] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Cheng S H, Zhong W, Isaacs K E, Mueller K. Visualizing the topology and data traffic of multi-dimensional torus interconnect networks. <i>IEEE Access</i>, 2018, 6: 57191–57204. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/ACCESS.2018.2872344" target="_blank">10.1109/ACCESS.2018.2872344</a>. </div> </td> </tr> <tr class="document-box" id="b29"> <td valign="top" class="td1"> [29] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Romanov A Y, Amerikanov A A, Lezhnev E V. Analysis of approaches for synthesis of networks-on-chip by using circulant topologies. <i>Journal of Physics: Conference Series</i>, 2018, 1050(1): 012071. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1088/1742-6596/1050/1/012071" target="_blank">10.1088/1742-6596/1050/1/012071</a>. </div> </td> </tr> <tr class="document-box" id="b30"> <td valign="top" class="td1"> [30] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ravankar A A, Sedukhin S G. Mesh-of-Tori: A novel interconnection network for frontal plane cellular processors. In <i>Proc</i>. <i>the 1st Int. Conf. Networking and Computing</i>, Nov. 2010, pp.281-284. DOI: <a href="https://doi.org/10.1109/IC-NC.2010.30">10.1109/IC-NC.2010.30</a>. </div> </td> </tr> <tr class="document-box" id="b31"> <td valign="top" class="td1"> [31] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pham P H, Mau P, Kim C. A 64-PE folded-torus intra-chip communication fabric for guaranteed throughput in network-on-chip based applications. In <i>Proc</i>. <i>the 2009 IEEE Custom Integrated Circuits Conference</i>, Sept. 2009, pp.645-648. DOI: <a href="https://doi.org/10.1109/CICC.2009.5280748">10.1109/CICC.2009.5280748</a>. </div> </td> </tr> <tr class="document-box" id="b32"> <td valign="top" class="td1"> [32] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Al-Fares M, Loukissas A, Vahdat A. A scalable, commodity data center network architecture. <i>ACM SIGCOMM Computer Communication Review</i>, 2008, 38(4): 63–74. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/1402946.1402967" target="_blank">10.1145/1402946.1402967</a>. </div> </td> </tr> <tr class="document-box" id="b33"> <td valign="top" class="td1"> [33] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Leiserson C E, Abuhamdeh Z S, Douglas D C et al. The network architecture of the connection machine CM-5 (extended abstract). In <i>Proc</i>. <i>the 4th Annual ACM Symposium on Parallel Algorithms and Architectures</i>, Jun. 1992, pp.272-285. DOI: <a href="https://doi.org/10.1145/140901.141883">10.1145/140901.141883</a>. </div> </td> </tr> <tr class="document-box" id="b34"> <td valign="top" class="td1"> [34] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Valerio M, Moser L E, Melliar-Smith P M. Recursively scalable fat-trees as interconnection networks. In <i>Proc</i>. <i>the 13th IEEE Annual International Phoenix Conference on Computers and Communications</i>, Apr. 1994. DOI: <a href="https://doi.org/10.1109/PCCC.1994.504091">10.1109/PCCC.1994.504091</a>. </div> </td> </tr> <tr class="document-box" id="b35"> <td valign="top" class="td1"> [35] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Nienaber W. Effective routing on fat-tree topologies [Ph. D. Thesis]. Florida State University, Tallahassee, 2014. </div> </td> </tr> <tr class="document-box" id="b36"> <td valign="top" class="td1"> [36] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Prisacari B, Rodriguez G, Minkenberg C, Hoefler T. Bandwidth-optimal all-to-all exchanges in fat tree networks. In <i>Proc</i>. <i>the 27th ICS</i>, Jun. 2013, pp.139-148. DOI: <a href="https://doi.org/10.1145/2464996.2465434">10.1145/2464996.2465434</a>. </div> </td> </tr> <tr class="document-box" id="b37"> <td valign="top" class="td1"> [37] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Li Y, Pan D. OpenFlow based load balancing for fat-tree networks with multipath support. In <i>Proc</i>. <i>the 12th IEEE International Conference on Communications</i>, Jun. 2013. </div> </td> </tr> <tr class="document-box" id="b38"> <td valign="top" class="td1"> [38] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kim J, Dally W J, Scott S, Abts D. Technology-driven, highly-scalable dragonfly topology. In <i>Proc</i>. <i>the 2008 International Symposium on Computer Architecture</i>, Jun. 2008, pp.77-88. DOI: <a href="https://doi.org/10.1109/ISCA.2008.19">10.1109/ISCA.2008.19</a>. </div> </td> </tr> <tr class="document-box" id="b39"> <td valign="top" class="td1"> [39] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Teh M Y, Wilke J J, Bergman K, Rumley S. Design space exploration of the dragonfly topology. In <i>Lecture Notes in Computer Science 10524</i>, Kunkel J, Yokota R, Taufer M et al. (eds.), Springer. pp.57-74. DOI: <a href="https://doi.org/10.1007/978-3-319-67630-2_5">10.1007/978-3-319-67630-2_5</a>. </div> </td> </tr> <tr class="document-box" id="b40"> <td valign="top" class="td1"> [40] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Prisacari B, Rodriguez G, Garcia M, Vallejo E, Beivide R, Minkenberg C. Performance implications of remote-only load balancing under adversarial traffic in dragonflies. In <i>Proc</i>. <i>the 8th International Workshop on Interconnection Network Architecture</i>: <i>On-Chip</i>, <i>Multi-Chip</i>, Jan. 2014. DOI: <a href="https://doi.org/10.1145/2556857.2556860">10.1145/2556857.2556860</a>. </div> </td> </tr> <tr class="document-box" id="b41"> <td valign="top" class="td1"> [41] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shpiner A, Haramaty Z, Eliad S, Zdornov V, Gafni B, Zahavi E. Dragonfly+: Low cost topology for scaling datacenters. In <i>Proc</i>. <i>the 3rd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era</i> (<i>HiPINEB</i>), Feb. 2017. DOI: <a href="https://doi.org/10.1109/HiPINEB.2017.11">10.1109/HiPINEB.2017.11</a>. </div> </td> </tr> <tr class="document-box" id="b42"> <td valign="top" class="td1"> [42] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Bruck J, Ho C T, Kipnis S, Weathersby D. Efficient algorithms for all-to-all communications in multi-port message-passing systems. In <i>Proc</i>. <i>the 6th Annual ACM Symposium on Parallel Algorithms and Architectures</i>, Aug. 1994, pp.298-309. DOI: <a href="https://doi.org/10.1145/181014.181756">10.1145/181014.181756</a>. </div> </td> </tr> <tr class="document-box" id="b43"> <td valign="top" class="td1"> [43] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Thakur R, Rabenseifner R, Gropp W. Optimization of collective communication operations in MPICH. <i>The International Journal of High Performance Computing Applications</i>, 2005, 19(1): 49–66. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1177/1094342005051521" target="_blank">10.1177/1094342005051521</a>. </div> </td> </tr> <tr class="document-box" id="b44"> <td valign="top" class="td1"> [44] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pjesivac-Grbovic J. Towards automatic and adaptive optimizations of MPI collective operations [Ph.D. Thesis]. University of Tennessee, Knoxville, 2007. </div> </td> </tr> <tr class="document-box" id="b45"> <td valign="top" class="td1"> [45] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Huse L P. Collective communication on dedicated clusters of workstations. In <i>Proc</i>. <i>the 6th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface</i>, Sept. 1999, pp.469-476. DOI: <a href="https://doi.org/10.1007/3-540-48158-3_58">10.1007/3-540-48158-3_58</a>. </div> </td> </tr> <tr class="document-box" id="b46"> <td valign="top" class="td1"> [46] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Barnett M, Shuler L, van De Geijn R, Gupta S, Payne D G, Watts J. Interprocessor collective communication library (InterCom). In <i>Proc</i>. <i>the IEEE Scalable High Performance Computing Conference</i>, May 1994, pp.357-364. DOI: <a href="https://doi.org/10.1109/SHPCC.1994.296665">10.1109/SHPCC.1994.296665</a>. </div> </td> </tr> <tr class="document-box" id="b47"> <td valign="top" class="td1"> [47] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shroff M, Van De Geijn R A. CollMark: MPI collective communication benchmark. In <i>Proc</i>. <i>the 2000 ICS</i>, June 29–July 2. </div> </td> </tr> <tr class="document-box" id="b48"> <td valign="top" class="td1"> [48] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Rabenseifner R. Optimization of collective reduction operations. In <i>Proc</i>. <i>the 4th Int. Conf. Computational Science</i>, Jun. 2004. DOI: <a href="https://doi.org/10.1007/978-3-540-24685-5_1">10.1007/978-3-540-24685-5_1</a>. </div> </td> </tr> <tr class="document-box" id="b49"> <td valign="top" class="td1"> [49] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Dong J B, Wang S C, Feng F <i>et al</i>. ACCL: Architecting highly scalable distributed training systems with highly efficient collective communication library. <i>IEEE Micro</i>, 2021, 41(5): 85–92. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/MM.2021.3091475" target="_blank">10.1109/MM.2021.3091475</a>. </div> </td> </tr> <tr class="document-box" id="b50"> <td valign="top" class="td1"> [50] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hockney R W. The communication challenge for MPP: Intel paragon and Meiko CS-2. <i>Parallel Computing</i>, 1994, 20(3): 389–398. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1016/S0167-8191(06)80021-9" target="_blank">10.1016/S0167-8191(06)80021-9</a>. </div> </td> </tr> <tr class="document-box" id="b51"> <td valign="top" class="td1"> [51] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Benson G D, Chu C W, Huang Q, Caglar S G. A comparison of MPICH allgather algorithms on switched networks. In <i>Proc</i>. <i>the 10th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface</i>, Oct. 2003, pp.335-343. DOI: <a href="https://doi.org/10.1007/978-3-540-39924-7_47">10.1007/978-3-540-39924-7_47</a>. </div> </td> </tr> <tr class="document-box" id="b52"> <td valign="top" class="td1"> [52] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Almási G, Heidelberger P, Archer C J et al. Optimization of MPI collective communication on BlueGene/L systems. In <i>Proc</i>. <i>the 19th ICS</i>, Jun. 2005, pp.253-262. DOI: <a href="https://doi.org/10.1145/1088149.1088183">10.1145/1088149.1088183</a>. </div> </td> </tr> <tr class="document-box" id="b53"> <td valign="top" class="td1"> [53] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Sergeev A, Del Balso M. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv: 1802.05799, 2018. <a href="https://arxiv.org/abs/1802.05799">https://arxiv.org/abs/1802.05799</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b54"> <td valign="top" class="td1"> [54] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Goyal P, Dollár P, Girshick R et al. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv: 1706.02677, 2017. <a href="https://arxiv.org/abs/1706.02677">https://arxiv.org/abs/1706.02677</a>, Jan.2023. </div> </td> </tr> <tr class="document-box" id="b55"> <td valign="top" class="td1"> [55] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Gupta U, Wu C, Wang X et al. The architectural implications of Facebook’s DNN-based personalized recommendation. In <i>Proc</i>. <i>the 2020 IEEE International Symposium on High Performance Computer Architecture</i> (<i>HPCA</i>), Feb. 2020, pp.488-501. DOI: <a href="https://doi.org/10.1109/HPCA47549.2020.00047">10.1109/HPCA47549.2020.00047</a>. </div> </td> </tr> <tr class="document-box" id="b56"> <td valign="top" class="td1"> [56] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Mudigere D, Hao Y, Huang J et al. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In <i>Proc</i>. <i>the 49th Annual International Symposium on Computer Architecture</i>, Jun. 2022, pp.993-1011. DOI: <a href="https://doi.org/10.1145/3470496.3533727">10.1145/3470496.3533727</a>. </div> </td> </tr> <tr class="document-box" id="b57"> <td valign="top" class="td1"> [57] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Paszke A, Gross S, Massa F et al. Pytorch: An imperative style, high-performance deep learning library. In <i>Proc</i>. <i>the 33rd International Conference on Neural Information Processing Systems</i>, Dec. 2019. </div> </td> </tr> <tr class="document-box" id="b58"> <td valign="top" class="td1"> [58] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Khudia D, Huang J Y, Basu P, Deng S, Liu H, Park J, Smelyanskiy M. FBGEMM: Enabling high-performance low-precision deep learning inference. arXiv: 2101.05615, 2021. <a href="https://arxiv.org/abs/2101.05615">https://arxiv.org/abs/2101.05615</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b59"> <td valign="top" class="td1"> [59] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In <i>Proc</i>. <i>the 2016 IEEE Conference on Computer Vision and Pattern Recognition</i> (<i>CVPR</i>), Jun. 2016, pp.770-778. DOI: <a href="https://doi.org/10.1109/CVPR.2016.90">10.1109/CVPR.2016.90</a>. </div> </td> </tr> <tr class="document-box" id="b60"> <td valign="top" class="td1"> [60] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Deng J, Dong W, Socher R, Li L J, Li K, Li F F. ImageNet: A large-scale hierarchical image database. In <i>Proc</i>. <i>the 2009 CVPR</i>, Jun. 2009, pp.248-255. DOI: <a href="https://doi.org/10.1109/CVPR.2009.5206848">10.1109/CVPR.2009.5206848</a>. </div> </td> </tr> <tr class="document-box" id="b61"> <td valign="top" class="td1"> [61] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Dean J, Corrado G S, Monga R, Chen K, Devin M, Le Q V, Mao M Z, Ranzato M A, Senior A, Tucker P, Yang K, Ng A Y. Large scale distributed deep networks. In <i>Proc</i>. <i>the 25th Int. Conf. Neural Information Processing Systems</i>, Dec. 2012, pp.1223-1231. </div> </td> </tr> <tr class="document-box" id="b62"> <td valign="top" class="td1"> [62] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Abadi M, Barham P, Chen J et al. TensorFlow: A system for large-scale machine learning. In <i>Proc</i>. <i>the 12th USENIX Conference on Operating Systems Design and Implementation</i>, Nov. 2016, pp.265-283. </div> </td> </tr> <tr class="document-box" id="b63"> <td valign="top" class="td1"> [63] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Awan A A, Bédorf J, Chu C H et al. Scalable distributed DNN training using TensorFlow and CUDA-aware MPI: Characterization, designs, and performance evaluation. In <i>Proc</i>. <i>the 19th IEEE/ACM Int. Symp. Cluster</i>, <i>Cloud and Grid Computing</i> (<i>CCGRID</i>), May 2019, pp.498-507. DOI: <a href="https://doi.org/10.1109/CCGRID.2019.00064">10.1109/CCGRID.2019.00064</a>. </div> </td> </tr> <tr class="document-box" id="b64"> <td valign="top" class="td1"> [64] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Biswas R, Lu X Y, Panda D K. Designing a micro-benchmark suite to evaluate gRPC for TensorFlow: Early experiences. In <i>Proc</i>. <i>the 9th Workshop on Big Data Benchmarks</i>, <i>Performance Optimization</i>, <i>and Emerging Hardware</i>, Mar. 2018. </div> </td> </tr> <tr class="document-box" id="b65"> <td valign="top" class="td1"> [65] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Biswas R, Lu X Y, Panda D K. Accelerating TensorFlow with adaptive RDMA-based gRPC. In <i>Proc</i>. <i>the 25th IEEE Int. Conf. High Performance Computing</i> (<i>HiPC</i>), Dec. 2018, pp.2-11. DOI: <a href="https://doi.org/10.1109/HiPC.2018.00010">10.1109/HiPC.2018.00010</a>. </div> </td> </tr> <tr class="document-box" id="b66"> <td valign="top" class="td1"> [66] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Jain A, Awan A A, Subramoni H, Panda D K. Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for high-performance deep learning on Frontera. In <i>Proc</i>. <i>the 3rd IEEE/ACM Workshop on Deep Learning on Supercomputers</i> (<i>DLS</i>), Nov. 2019, pp.76-83. DOI: <a href="https://doi.org/10.1109/DLS49591.2019.00015">10.1109/DLS49591.2019.00015</a>. </div> </td> </tr> <tr class="document-box" id="b67"> <td valign="top" class="td1"> [67] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang Z, Zheng S, Wang Y S <i>et al</i>. MiCS: Near-linear scaling for training gigantic model on public cloud. <i>Proceedings of the VLDB Endowment</i>, 2022, 16(1): 37–50. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.14778/3561261.3561265" target="_blank">10.14778/3561261.3561265</a>. </div> </td> </tr> <tr class="document-box" id="b68"> <td valign="top" class="td1"> [68] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Rajbhandari S, Rasley J, Ruwase O, He Y X. ZeRO: Memory optimizations toward training trillion parameter models. In <i>Proc</i>. <i>the 2020 SC</i>, Nov. 2020. </div> </td> </tr> <tr class="document-box" id="b69"> <td valign="top" class="td1"> [69] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Jia Y, Shelhamer E, Donahue J et al. Caffe: Convolutional architecture for fast feature embedding. In <i>Proc</i>. <i>the 22nd ACM International Conference on Multimedia</i>, Nov. 2014, pp.675-678. DOI: <a href="https://doi.org/10.1145/2647868.2654889">10.1145/2647868.2654889</a>. </div> </td> </tr> <tr class="document-box" id="b70"> <td valign="top" class="td1"> [70] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Seide F, Agarwal A. CNTK: Microsoft’s open-source deep-learning toolkit. In <i>Proc</i>. <i>the 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining</i>, Aug. 2016, p.2135. DOI: <a href="https://doi.org/10.1145/2939672.2945397">10.1145/2939672.2945397</a>. </div> </td> </tr> <tr class="document-box" id="b71"> <td valign="top" class="td1"> [71] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Chen T Q, Li M, Li Y T, Lin M, Wang N Y, Wang M J, Xiao T J, Xu B, Zhang C Y, Zhang Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv: 1512.01274, 2015. <a href="https://arxiv.org/abs/1512.01274">https://arxiv.org/abs/1512.01274</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b72"> <td valign="top" class="td1"> [72] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lin L X, Qiu S H, Yu Z Q, You L, Long X, Sun X Y, Xu J, Wang Z. AIACC-training: Optimizing distributed deep learning training through multi-streamed and concurrent gradient communications. In <i>Proc</i>. <i>the 42nd IEEE Int. Conf. Distributed Computing Systems</i>, Jul. 2022, pp.853-863. DOI: <a href="https://doi.org/10.1109/ICDCS54860.2022.00087">10.1109/ICDCS54860.2022.00087</a>. </div> </td> </tr> <tr class="document-box" id="b73"> <td valign="top" class="td1"> [73] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Cowan M, Maleki S, Musuvathi M et al. MSCCL: Microsoft collective communication library. arXiv: 2201.11840, 2022. <a href="https://arxiv.org/abs/2201.11840v1">https://arxiv.org/abs/2201.11840v1</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b74"> <td valign="top" class="td1"> [74] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shah A, Chidambaram V, Cowan M et al. TACCL: Guiding collective algorithm synthesis using communication sketches. In <i>Proc</i>. <i>the 2023 USENIX Symposium on Networked Systems Design and Implementation</i>, April 2023. </div> </td> </tr> <tr class="document-box" id="b75"> <td valign="top" class="td1"> [75] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Cai Z X, Liu Z Y, Maleki S, Musuvathi M, Mytkowicz T, Nelson J, Saarikivi O. Synthesizing optimal collective algorithms. In <i>Proc</i>. <i>the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming</i>, Feb. 2021, pp.62-75. DOI: <a href="https://doi.org/10.1145/3437801.3441620">10.1145/3437801.3441620</a>. </div> </td> </tr> <tr class="document-box" id="b76"> <td valign="top" class="td1"> [76] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Panda D K, Tomko K, Schulz K, Majumdar A. The MVAPICH project: Evolution and sustainability of an open source production quality MPI library for HPC. In <i>Proc</i>. <i>the Workshop on Sustainable Software for Science</i>: <i>Practice and Experiences</i>, Nov. 2013. </div> </td> </tr> <tr class="document-box" id="b77"> <td valign="top" class="td1"> [77] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Wang G H, Venkataraman S, Phanishayee A et al. Blink: Fast and generic collectives for distributed ML. In <i>Proc</i>. <i>the 2020 Machine Learning and Systems 2</i>, Mar. 2020, pp.172-186. </div> </td> </tr> <tr class="document-box" id="b78"> <td valign="top" class="td1"> [78] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang Z, Chang C K, Lin H B et al. Is network the bottleneck of distributed training? In <i>Proc</i>. <i>the 2020 Workshop on Network Meets AI & ML</i>, Aug. 2020, pp.8-13. DOI: <a href="https://doi.org/10.1145/3405671.3405810">10.1145/3405671.3405810</a>. </div> </td> </tr> <tr class="document-box" id="b79"> <td valign="top" class="td1"> [79] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Wickramasinghe U, Lumsdaine A. A survey of methods for collective communication optimization and tuning. arXiv: 1611.06334, 2016. <a href="https://arxiv.org/abs/1611.06334">https://arxiv.org/abs/1611.06334</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b80"> <td valign="top" class="td1"> [80] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Chan E N, Heimlich M, Purkayastha A, van de Geijn R. Collective communication: Theory, practice, and experience. <i>Concurrency and Computation: Practice and Experience</i>, 2007, 19(13): 1749–1783. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1002/cpe.1206" target="_blank">10.1002/cpe.1206</a>. </div> </td> </tr> <tr class="document-box" id="b81"> <td valign="top" class="td1"> [81] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pješivac-Grbović J, Angskun T, Bosilca G, Fagg G E, Gabriel E, Dongarra J J. Performance analysis of MPI collective operations. <i>Cluster Computing</i>, 2007, 10(2): 127–143. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1007/s10586-007-0012-0" target="_blank">10.1007/s10586-007-0012-0</a>. </div> </td> </tr> <tr class="document-box" id="b82"> <td valign="top" class="td1"> [82] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Vadhiyar S S, Fagg G E, Dongarra J. Automatically tuned collective communications. In <i>Proc</i>. <i>the 2000 ACM/IEEE Conference on Supercomputing</i>, Nov. 2000. DOI: <a href="https://doi.org/10.1109/SC.2000.10024">10.1109/SC.2000.10024</a>. </div> </td> </tr> <tr class="document-box" id="b83"> <td valign="top" class="td1"> [83] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Verbraeken J, Wolting M, Katzy J <i>et al</i>. A survey on distributed machine learning. <i>ACM Computing Surveys</i>, 2020, 53(2): 30. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/3377454" target="_blank">10.1145/3377454</a>. </div> </td> </tr> <tr class="document-box" id="b84"> <td valign="top" class="td1"> [84] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Wang M, Fu W J, He X N, Hao S J, Wu X D. A survey on large-scale machine learning. <i>IEEE Trans. Knowledge and Data Engineering</i>, 2022, 34(6): 2574–2594. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/TKDE.2020.3015777" target="_blank">10.1109/TKDE.2020.3015777</a>. </div> </td> </tr> <tr class="document-box" id="b85"> <td valign="top" class="td1"> [85] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. <i>ACM Computing Surveys</i>, 2019, 52(4): Article No. 65. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/3320060" target="_blank">10.1145/3320060</a>. </div> </td> </tr> <tr class="document-box" id="b86"> <td valign="top" class="td1"> [86] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Mayer R, Jacobsen H A. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. <i>ACM Computing Surveys</i>, 2020, 53(1): Article No. 3. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/3363554" target="_blank">10.1145/3363554</a>. </div> </td> </tr> <tr class="document-box" id="b87"> <td valign="top" class="td1"> [87] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ouyang S, Dong D Z, Xu Y M, Xiao L Q. Communication optimization strategies for distributed deep neural network training: A survey. <i>Journal of Parallel and Distributed Computing</i>, 2021, 149: 52–65. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1016/j.jpdc.2020.11.005" target="_blank">10.1016/j.jpdc.2020.11.005</a>. </div> </td> </tr> <tr class="document-box" id="b88"> <td valign="top" class="td1"> [88] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lee S, Purushwalkam S, Cogswell M, Crandall D, Batra D. Why M heads are better than one: Training a diverse ensemble of deep networks. arXiv: 1511.06314, 2015. <a href="https://arxiv.org/abs/1511.06314">https://arxiv.org/abs/1511.06314</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b89"> <td valign="top" class="td1"> [89] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In <i>Proc</i>. <i>the 25th International Conference on Neural Information Processing Systems</i>, Dec. 2012, pp.1097-1105. </div> </td> </tr> <tr class="document-box" id="b90"> <td valign="top" class="td1"> [90] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In <i>Proc</i>. <i>the 2015 </i><i>CVPR</i>, Jun. 2015. DOI: <a href="https://doi.org/10.1109/CVPR.2015.7298594">10.1109/CVPR.2015.7298594</a>. </div> </td> </tr> <tr class="document-box" id="b91"> <td valign="top" class="td1"> [91] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In <i>Proc</i>. <i>the 2016 </i><i>CVPR</i>, Jun. 2016, pp.770-778. DOI: <a href="https://doi.org/10.1109/CVPR.2016.90">10.1109/CVPR.2016.90</a>. </div> </td> </tr> <tr class="document-box" id="b92"> <td valign="top" class="td1"> [92] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shi S H, Wang Q, Chu X W. Performance modeling and evaluation of distributed deep learning frameworks on GPUs. In <i>Proc. the </i><i>DASC/PiCom/DataCom/CyberSciTech</i>, Aug. 2018, pp.949-957. DOI: <a href="https://doi.org/10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.000-4">10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.000-4</a>. </div> </td> </tr> <tr class="document-box" id="b93"> <td valign="top" class="td1"> [93] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hoefler T, Moor D. Energy, memory, and runtime tradeoffs for implementing collective communication operations. <i>Supercomputing Frontiers and Innovations</i>, 2014, 1(2): 58–75. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.14529/jsfi140204" target="_blank">10.14529/jsfi140204</a>. </div> </td> </tr> </tbody> </table>
[1] . 基于忆阻器的神经网络加速器数据位宽与工作电压协同优化机制[J]. 计算机科学技术学报, 2023, 38(2): 328-348.
[2] . 用于实时语义分割的高效多分支网络研究[J]. 计算机科学技术学报, 2022, 37(6): 1478-1491.
[3] . FlexPDA: 一种面向深度学习加速器的灵活的编程框架[J]. 计算机科学技术学报, 2022, 37(5): 1200-1220.
[4] 高翼枭, 田臣, 陈伟, 李多星, 龚媛媛, 闫健, 龚媛媛, 王炳权, 吴涛, 韩磊, 齐法制, 曾珊, 窦万春, 陈贵海. RDMA网络中的数据报文错误的分析与优化[J]. 计算机科学技术学报, 2022, 37(4): 743-762.
[5] 张鑫, 陆思源, 王水花, 余翔, 王甦菁, 姚仑, 潘毅, 张煜东. 通过新型深度学习架构诊断COVID-19肺炎[J]. 计算机科学技术学报, 2022, 37(2): 330-343.
[6] Songjie Niu, Shimin Chen. TransGPerf:利用迁移学习建模分布式图计算性能[J]. 计算机科学技术学报, 2021, 36(4): 778-791.
[7] Sheng-Luan Hou, Xi-Kun Huang, Chao-Qun Fei, Shu-Han Zhang, Yang-Yang Li, Qi-Lin Sun, Chuan-Qing Wang. 基于深度学习的文本摘要研究综述[J]. 计算机科学技术学报, 2021, 36(3): 633-663.
[8] Lan Chen, Juntao Ye, Xiaopeng Zhang. 基于多特征超分网络的布料褶皱合成[J]. 计算机科学技术学报, 2021, 36(3): 478-493.
[9] Yu-Jie Yuan, Yukun Lai, Tong Wu, Lin Gao, Li-Gang Liu. 回顾形状编辑技术:从几何角度到神经网络方法[J]. 计算机科学技术学报, 2021, 36(3): 520-554.
[10] Wei Du, Yu Sun, Hui-Min Bao, Liang Chen, Ying Li, Yan-Chun Liang. 基于迁移学习与深度学习的人类血液分泌蛋白预测框架[J]. 计算机科学技术学报, 2021, 36(2): 234-247.
[11] Jun Gao, Paul Liu, Guang-Di Liu, Le Zhang. 基于深度学习与波束偏转的穿刺针定位与增强算法[J]. 计算机科学技术学报, 2021, 36(2): 334-346.
[12] Hua Chen, Juan Liu, Qing-Man Wen, Zhi-Qun Zuo, Jia-Sheng Liu, Jing Feng, Bao-Chuan Pang, Di Xiao. CytoBrain:基于深度学习技术的宫颈癌筛查系统[J]. 计算机科学技术学报, 2021, 36(2): 347-360.
[13] Andrea Caroppo, Alessandro Leone, Pietro Siciliano. 用于老年人面部表情识别的深度学习模型和传统机器学习方法的对比研究[J]. 计算机科学技术学报, 2020, 35(5): 1127-1146.
[14] 梁盾, 郭元晨, 张少魁, 穆太江, 黄晓蕾. 车道检测-新结果和调查研究[J]. 计算机科学技术学报, 2020, 35(3): 493-505.
[15] Zheng Zeng, Lu Wang, Bei-Bei Wang, Chun-Meng Kang, Yan-Ning Xu. 一种基于多重残差网络的随机渐进式光子映射的降噪方法[J]. 计算机科学技术学报, 2020, 35(3): 506-521.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 李未;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] 李万学;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[4] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[5] C.Y.Chung; 华宣仁;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[6] 孙钟秀; 商陆军;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[7] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[8] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[9] 金兰; 杨元元;. A Modified Version of Chordal Ring[J]. , 1986, 1(3): 15 -32 .
[10] 潘启敬;. A Routing Algorithm with Candidate Shortest Path[J]. , 1986, 1(3): 33 -52 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: