Journal of Computer Science and Technology ›› 2023, Vol. 38 ›› Issue (1): 166-195.doi: 10.1007/s11390-023-2894-6

Special Issue: Surveys; Computer Architecture and Systems; Artificial Intelligence and Pattern Recognition

• Special Issue in Honor of Professor Kai Hwang’s 80th Birthday • Previous Articles     Next Articles

xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning

Adam Weingram, Yuke Li (李雨珂), Student Member, ACM, Hao Qi (戚昊), Darren Ng, Liuyao Dai (代柳瑶), and Xiaoyi Lu (鲁小亿), Member, ACM, IEEE        

  1. Department of Computer Science and Engineering, University of California, Merced, Merced 95343, U.S.A.
  • Received:2022-10-08 Revised:2022-11-09 Accepted:2023-01-03 Online:2023-02-28 Published:2023-02-28
  • Contact: Adam Weingram E-mail:aweingram@ucmerced.edu
  • About author:Adam Weingram is a Ph.D. student in the Parallel and Distributed Systems Laboratory (PADSYS Lab) of Department of Computer Science and Engineering at the University of California, Merced (UCM). Previously, he received his B.S. degree in computer science from UCM. His research interests include systems for machine learning and applications of computer science in remote sensing.
  • Supported by:
    This work was supported in part by the National Science Foundation of USA under Grant No. CCF #2132049, a Google Research Award, and a Meta Faculty Research Award. This work used the Expanse cluster at SDSC (San Diego Supercomputer Center) through allocation CIS210053 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation of USA under Grant Nos. 2138259, 2138286, 2138307, 2137603, and 2138296.

Machine learning techniques have become ubiquitous both in industry and academic applications. Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches. Collective communications greatly simplify inter- and intra-node data transfer and are an essential part of the distributed training process as information such as gradients must be shared between processing nodes. In this paper, we survey the current state-of-the-art collective communication libraries (namely xCCL, including NCCL, oneCCL, RCCL, MSCCL, ACCL, and Gloo), with a focus on the industry-led ones for deep learning workloads. We investigate the design features of these xCCLs, discuss their use cases in the industry deep learning workloads, compare their performance with industry-made benchmarks (i.e., NCCL Tests and PARAM), and discuss key take-aways and interesting observations. We believe our survey sheds light on potential research directions of future designs for xCCLs.

Key words: collective; deep learning; distributed training; GPUDirect; RDMA (remote direct memory access);

<table class="reference-tab" style="background-color:#FFFFFF;width:914.104px;color:#333333;font-family:Calibri, Arial, 微软雅黑, "font-size:16px;"> <tbody> <tr class="document-box" id="b1"> <td valign="top" class="td1"> [1] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hwang K, Xu Z W. Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill, 1998. </div> </td> </tr> <tr class="document-box" id="b2"> <td valign="top" class="td1"> [2] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Brown T B, Mann B, Ryder N et al. Language models are few-shot learners. In <i>Proc</i>. <i>the 34th Int. Conf. Neural Information Processing Systems</i>, Dec. 2020, pp.1877-1901. </div> </td> </tr> <tr class="document-box" id="b3"> <td valign="top" class="td1"> [3] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Naumov M, Mudigere D, Shi H J M et al. Deep learning recommendation model for personalization and recommendation systems. arXiv: 1906.00091, 2019. <a href="https://arxiv.org/abs/1906.00091">https://arxiv.org/abs/1906.00091</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b4"> <td valign="top" class="td1"> [4] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Bayatpour M, Chakraborty S, Subramoni H, Lu X Y, Panda D K. Scalable reduction collectives with data partitioning-based multi-leader design. In <i>Proc</i>. <i>the 2017 Int. Conf. High Performance Computing</i>, <i>Networking</i>, <i>Storage and Analysis</i> (<i>SC</i>), Nov. 2017. DOI: <a href="https://doi.org/10.1145/3126908.3126954">10.1145/3126908.3126954</a>. </div> </td> </tr> <tr class="document-box" id="b5"> <td valign="top" class="td1"> [5] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Chu C H, Lu X Y, Awan A A, Subramoni H, Hashmi J, Elton B, Panda D K. Efficient and scalable multi-source streaming broadcast on GPU clusters for deep learning. In <i>Proc</i>. <i>the 46th Int. Conf. Parallel Processing</i> (<i>ICPP</i>), Aug. 2017, pp.161-170. DOI: <a href="https://doi.org/10.1109/ICPP.2017.25">10.1109/ICPP.2017.25</a>. </div> </td> </tr> <tr class="document-box" id="b6"> <td valign="top" class="td1"> [6] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Panda D K, Lu X Y, Shankar D. High-Performance Big Data Computing. The MIT Press, 2022. </div> </td> </tr> <tr class="document-box" id="b7"> <td valign="top" class="td1"> [7] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lu X Y, Islam N S, Wasi-Ur-Rahman et al. High-performance design of Hadoop RPC with RDMA over InfiniBand. In <i>Proc</i>. <i>the 42nd ICPP</i>, Oct. 2013, pp.641-650. DOI: <a href="https://doi.org/10.1109/ICPP.2013.78">10.1109/ICPP.2013.78</a>. </div> </td> </tr> <tr class="document-box" id="b8"> <td valign="top" class="td1"> [8] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Wasi-Ur-Rahman, Lu X Y, Islam N S, Panda D K. HOMR: A hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects. In <i>Proc</i>. <i>the 28th ACM Int. Conf. Supercomputing</i> (<i>ICS</i>), Jun. 2014, pp.33-42. DOI: <a href="https://doi.org/10.1145/2597652.2597684">10.1145/2597652.2597684</a>. </div> </td> </tr> <tr class="document-box" id="b9"> <td valign="top" class="td1"> [9] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Islam N S, Lu X Y, Wasi-Ur-Rahman, Panda D K. SOR-HDFS: A SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS. In <i>Proc</i>. <i>the 23rd Int. Symp. High–Performance Parallel and Distributed Computing</i>, Jun. 2014, pp.261-264. DOI: <a href="https://doi.org/10.1145/2600212.2600715">10.1145/2600212.2600715</a>. </div> </td> </tr> <tr class="document-box" id="b10"> <td valign="top" class="td1"> [10] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lu X Y, Shankar D, Gugnani S, Panda D K. High-performance design of Apache Spark with RDMA and its benefits on various workloads. In <i>Proc</i>. <i>the 2016 IEEE Int. Conf. Big Data</i>, Dec. 2016, pp.253-262. DOI: <a href="https://doi.org/10.1109/BigData.2016.7840611">10.1109/BigData.2016.7840611</a>. </div> </td> </tr> <tr class="document-box" id="b11"> <td valign="top" class="td1"> [11] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kalia A, Kaminsky M, Andersen D G. Using RDMA efficiently for key-value services. In <i>Proc</i>. <i>the 2014 ACM Conference on SIGCOMM</i>, Aug. 2014, pp.295-306. DOI: <a href="https://doi.org/10.1145/2619239.2626299">10.1145/2619239.2626299</a>. </div> </td> </tr> <tr class="document-box" id="b12"> <td valign="top" class="td1"> [12] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shankar D, Lu X Y, Panda D K. SCOR-KV: SIMD-aware client-centric and optimistic RDMA-based key-value store for emerging CPU architectures. In <i>Proc</i>. <i>the 2019 SC</i>, Dec. 2019, pp.257-266. DOI: <a href="https://doi.org/10.1109/HiPC.2019.00040">10.1109/HiPC.2019.00040</a>. </div> </td> </tr> <tr class="document-box" id="b13"> <td valign="top" class="td1"> [13] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Dragojević A, Narayanan D, Hodson O, Castro M. FaRM: Fast remote memory. In <i>Proc</i>. <i>the 11th USENIX Symposium on Networked Systems Design and Implementation</i>, Apr. 2014, pp.401-414. </div> </td> </tr> <tr class="document-box" id="b14"> <td valign="top" class="td1"> [14] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shankar D, Lu X Y, Islam N, Wasi-Ur-Rahman, Panda D K. High-performance hybrid key-value store on modern clusters with RDMA interconnects and SSDs: Non-blocking extensions, designs, and benefits. In <i>Proc</i>. <i>the 2016 IEEE International Parallel and Distributed Processing Symposium</i> (<i>IPDPS</i>), May 2016, pp.393-402. DOI: <a href="https://doi.org/10.1109/IPDPS.2016.112">10.1109/IPDPS.2016.112</a>. </div> </td> </tr> <tr class="document-box" id="b15"> <td valign="top" class="td1"> [15] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Gugnani S, Lu X Y, Panda D K. Swift-X: Accelerating OpenStack swift with RDMA for building an efficient HPC cloud. In <i>Proc</i>. <i>the 17th IEEE/ACM International Symposium on Cluster</i>, <i>Cloud and Grid Computing</i>, May 2017, pp.238-247. DOI: <a href="https://doi.org/10.1109/CCGRID.2017.103">10.1109/CCGRID.2017.103</a>. </div> </td> </tr> <tr class="document-box" id="b16"> <td valign="top" class="td1"> [16] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Gugnani S, Lu X Y, Panda D K. Designing virtualization-aware and automatic topology detection schemes for accelerating Hadoop on SR-IOV-enabled clouds. In <i>Proc</i>. <i>the 2016 IEEE Int. Conf. Cloud Computing Technology and Science</i>, Dec. 2016, pp.152-159. DOI: <a href="https://doi.org/10.1109/CloudCom.2016.0037">10.1109/CloudCom.2016.0037</a>. </div> </td> </tr> <tr class="document-box" id="b17"> <td valign="top" class="td1"> [17] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang J, Lu X Y, Panda D K. Designing locality and NUMA aware MPI runtime for nested virtualization based HPC cloud with SR-IOV enabled InfiniBand. In <i>Proc</i>. <i>the 13th ACM SIGPLAN/SIGOPS Int. Conf. Virtual ution Environments</i>, Apr. 2017, pp.187-200. DOI: <a href="https://doi.org/10.1145/3050748.3050765">10.1145/3050748.3050765</a>. </div> </td> </tr> <tr class="document-box" id="b18"> <td valign="top" class="td1"> [18] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Chu C H, Lu X Y, Awan A A <i>et al</i>. Exploiting hardware multicast and GPUDirect RDMA for efficient broadcast. <i>IEEE Trans. Parallel and Distributed Systems</i>, 2019, 30(3): 575–588. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/TPDS.2018.2867222" target="_blank">10.1109/TPDS.2018.2867222</a>. </div> </td> </tr> <tr class="document-box" id="b19"> <td valign="top" class="td1"> [19] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang J, Lu X Y, Chu C H, Panda D K. C-GDR: High-performance container-aware GPUDirect MPI communication schemes on RDMA networks. In <i>Proc</i>. <i>the 2019 IPDPS</i>, May 2019, pp.242-251. DOI: <a href="https://doi.org/10.1109/IPDPS.2019.00034">10.1109/IPDPS.2019.00034</a>. </div> </td> </tr> <tr class="document-box" id="b20"> <td valign="top" class="td1"> [20] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Li Y K, Qi H, Lu G, Jin F, Guo Y F, Lu X Y. Understanding hot interconnects with an extensive benchmark survey. <i>BenchCouncil Trans. Benchmarks, Standards and Evaluations</i>, 2022, 2(3): 100074. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1016/J.TBENCH.2022.100074" target="_blank">10.1016/J.TBENCH.2022.100074</a>. </div> </td> </tr> <tr class="document-box" id="b21"> <td valign="top" class="td1"> [21] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pacheco P. An Introduction to Parallel Programming. Elsevier, 2011. DOI: <a href="https://doi.org/10.1016/C2009-0-18471-4">10.1016/C2009-0-18471-4</a>. </div> </td> </tr> <tr class="document-box" id="b22"> <td valign="top" class="td1"> [22] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Gong Y F, He B S, Zhong J L. Network performance aware MPI collective communication operations in the cloud. <i>IEEE Trans. Parallel and Distributed Systems</i>, 2015, 26(11): 3079–3089. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/TPDS.2013.96" target="_blank">10.1109/TPDS.2013.96</a>. </div> </td> </tr> <tr class="document-box" id="b23"> <td valign="top" class="td1"> [23] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Brown K A, Domke J, Matsuoka S. Hardware-centric analysis of network performance for MPI applications. In <i>Proc</i>. <i>the 21st IEEE Int. Conf. Parallel and Distributed Systems</i> (<i>ICPADS</i>), Dec. 2015, pp.692-699. DOI: <a href="https://doi.org/10.1109/ICPADS.2015.92">10.1109/ICPADS.2015.92</a>. </div> </td> </tr> <tr class="document-box" id="b24"> <td valign="top" class="td1"> [24] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Katseff H P. Incomplete hypercubes. <i>IEEE Trans. Computers</i>, 1988, 37(5): 604–608. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/12.4611" target="_blank">10.1109/12.4611</a>. </div> </td> </tr> <tr class="document-box" id="b25"> <td valign="top" class="td1"> [25] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kalb J L, Lee D S. Network topology analysis. Technical Report SAND2008-0069. Sandia National Laboratories, Albuquerque, New Mexico, 2008. <a href="https://digital.library.unt.edu/ark:/67531/metadc845229/m2/1/high_res_d/1028919.pdf">https://digital.library.unt.edu/ark:/67531/metadc845229/m2/1/high_res_d/1028919.pdf</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b26"> <td valign="top" class="td1"> [26] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kim J, Kim H. Router microarchitecture and scalability of ring topology in on-chip networks. In <i>Proc</i>. <i>the 2nd Int. Workshop on Network on Chip Architectures</i>, Dec. 2009, pp.5-10. DOI: <a href="https://doi.org/10.1145/1645213.1645217">10.1145/1645213.1645217</a>. </div> </td> </tr> <tr class="document-box" id="b27"> <td valign="top" class="td1"> [27] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Bouknight W J, Denenberg S A, McIntyre D E, Randall J M, Sameh A H, Slotnick D L. The Illiac IV system. <i>Proceedings of the IEEE</i>, 1972, 60(4): 369–388. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/PROC.1972.8647" target="_blank">10.1109/PROC.1972.8647</a>. </div> </td> </tr> <tr class="document-box" id="b28"> <td valign="top" class="td1"> [28] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Cheng S H, Zhong W, Isaacs K E, Mueller K. Visualizing the topology and data traffic of multi-dimensional torus interconnect networks. <i>IEEE Access</i>, 2018, 6: 57191–57204. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/ACCESS.2018.2872344" target="_blank">10.1109/ACCESS.2018.2872344</a>. </div> </td> </tr> <tr class="document-box" id="b29"> <td valign="top" class="td1"> [29] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Romanov A Y, Amerikanov A A, Lezhnev E V. Analysis of approaches for synthesis of networks-on-chip by using circulant topologies. <i>Journal of Physics: Conference Series</i>, 2018, 1050(1): 012071. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1088/1742-6596/1050/1/012071" target="_blank">10.1088/1742-6596/1050/1/012071</a>. </div> </td> </tr> <tr class="document-box" id="b30"> <td valign="top" class="td1"> [30] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ravankar A A, Sedukhin S G. Mesh-of-Tori: A novel interconnection network for frontal plane cellular processors. In <i>Proc</i>. <i>the 1st Int. Conf. Networking and Computing</i>, Nov. 2010, pp.281-284. DOI: <a href="https://doi.org/10.1109/IC-NC.2010.30">10.1109/IC-NC.2010.30</a>. </div> </td> </tr> <tr class="document-box" id="b31"> <td valign="top" class="td1"> [31] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pham P H, Mau P, Kim C. A 64-PE folded-torus intra-chip communication fabric for guaranteed throughput in network-on-chip based applications. In <i>Proc</i>. <i>the 2009 IEEE Custom Integrated Circuits Conference</i>, Sept. 2009, pp.645-648. DOI: <a href="https://doi.org/10.1109/CICC.2009.5280748">10.1109/CICC.2009.5280748</a>. </div> </td> </tr> <tr class="document-box" id="b32"> <td valign="top" class="td1"> [32] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Al-Fares M, Loukissas A, Vahdat A. A scalable, commodity data center network architecture. <i>ACM SIGCOMM Computer Communication Review</i>, 2008, 38(4): 63–74. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/1402946.1402967" target="_blank">10.1145/1402946.1402967</a>. </div> </td> </tr> <tr class="document-box" id="b33"> <td valign="top" class="td1"> [33] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Leiserson C E, Abuhamdeh Z S, Douglas D C et al. The network architecture of the connection machine CM-5 (extended abstract). In <i>Proc</i>. <i>the 4th Annual ACM Symposium on Parallel Algorithms and Architectures</i>, Jun. 1992, pp.272-285. DOI: <a href="https://doi.org/10.1145/140901.141883">10.1145/140901.141883</a>. </div> </td> </tr> <tr class="document-box" id="b34"> <td valign="top" class="td1"> [34] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Valerio M, Moser L E, Melliar-Smith P M. Recursively scalable fat-trees as interconnection networks. In <i>Proc</i>. <i>the 13th IEEE Annual International Phoenix Conference on Computers and Communications</i>, Apr. 1994. DOI: <a href="https://doi.org/10.1109/PCCC.1994.504091">10.1109/PCCC.1994.504091</a>. </div> </td> </tr> <tr class="document-box" id="b35"> <td valign="top" class="td1"> [35] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Nienaber W. Effective routing on fat-tree topologies [Ph. D. Thesis]. Florida State University, Tallahassee, 2014. </div> </td> </tr> <tr class="document-box" id="b36"> <td valign="top" class="td1"> [36] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Prisacari B, Rodriguez G, Minkenberg C, Hoefler T. Bandwidth-optimal all-to-all exchanges in fat tree networks. In <i>Proc</i>. <i>the 27th ICS</i>, Jun. 2013, pp.139-148. DOI: <a href="https://doi.org/10.1145/2464996.2465434">10.1145/2464996.2465434</a>. </div> </td> </tr> <tr class="document-box" id="b37"> <td valign="top" class="td1"> [37] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Li Y, Pan D. OpenFlow based load balancing for fat-tree networks with multipath support. In <i>Proc</i>. <i>the 12th IEEE International Conference on Communications</i>, Jun. 2013. </div> </td> </tr> <tr class="document-box" id="b38"> <td valign="top" class="td1"> [38] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kim J, Dally W J, Scott S, Abts D. Technology-driven, highly-scalable dragonfly topology. In <i>Proc</i>. <i>the 2008 International Symposium on Computer Architecture</i>, Jun. 2008, pp.77-88. DOI: <a href="https://doi.org/10.1109/ISCA.2008.19">10.1109/ISCA.2008.19</a>. </div> </td> </tr> <tr class="document-box" id="b39"> <td valign="top" class="td1"> [39] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Teh M Y, Wilke J J, Bergman K, Rumley S. Design space exploration of the dragonfly topology. In <i>Lecture Notes in Computer Science 10524</i>, Kunkel J, Yokota R, Taufer M et al. (eds.), Springer. pp.57-74. DOI: <a href="https://doi.org/10.1007/978-3-319-67630-2_5">10.1007/978-3-319-67630-2_5</a>. </div> </td> </tr> <tr class="document-box" id="b40"> <td valign="top" class="td1"> [40] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Prisacari B, Rodriguez G, Garcia M, Vallejo E, Beivide R, Minkenberg C. Performance implications of remote-only load balancing under adversarial traffic in dragonflies. In <i>Proc</i>. <i>the 8th International Workshop on Interconnection Network Architecture</i>: <i>On-Chip</i>, <i>Multi-Chip</i>, Jan. 2014. DOI: <a href="https://doi.org/10.1145/2556857.2556860">10.1145/2556857.2556860</a>. </div> </td> </tr> <tr class="document-box" id="b41"> <td valign="top" class="td1"> [41] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shpiner A, Haramaty Z, Eliad S, Zdornov V, Gafni B, Zahavi E. Dragonfly+: Low cost topology for scaling datacenters. In <i>Proc</i>. <i>the 3rd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era</i> (<i>HiPINEB</i>), Feb. 2017. DOI: <a href="https://doi.org/10.1109/HiPINEB.2017.11">10.1109/HiPINEB.2017.11</a>. </div> </td> </tr> <tr class="document-box" id="b42"> <td valign="top" class="td1"> [42] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Bruck J, Ho C T, Kipnis S, Weathersby D. Efficient algorithms for all-to-all communications in multi-port message-passing systems. In <i>Proc</i>. <i>the 6th Annual ACM Symposium on Parallel Algorithms and Architectures</i>, Aug. 1994, pp.298-309. DOI: <a href="https://doi.org/10.1145/181014.181756">10.1145/181014.181756</a>. </div> </td> </tr> <tr class="document-box" id="b43"> <td valign="top" class="td1"> [43] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Thakur R, Rabenseifner R, Gropp W. Optimization of collective communication operations in MPICH. <i>The International Journal of High Performance Computing Applications</i>, 2005, 19(1): 49–66. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1177/1094342005051521" target="_blank">10.1177/1094342005051521</a>. </div> </td> </tr> <tr class="document-box" id="b44"> <td valign="top" class="td1"> [44] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pjesivac-Grbovic J. Towards automatic and adaptive optimizations of MPI collective operations [Ph.D. Thesis]. University of Tennessee, Knoxville, 2007. </div> </td> </tr> <tr class="document-box" id="b45"> <td valign="top" class="td1"> [45] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Huse L P. Collective communication on dedicated clusters of workstations. In <i>Proc</i>. <i>the 6th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface</i>, Sept. 1999, pp.469-476. DOI: <a href="https://doi.org/10.1007/3-540-48158-3_58">10.1007/3-540-48158-3_58</a>. </div> </td> </tr> <tr class="document-box" id="b46"> <td valign="top" class="td1"> [46] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Barnett M, Shuler L, van De Geijn R, Gupta S, Payne D G, Watts J. Interprocessor collective communication library (InterCom). In <i>Proc</i>. <i>the IEEE Scalable High Performance Computing Conference</i>, May 1994, pp.357-364. DOI: <a href="https://doi.org/10.1109/SHPCC.1994.296665">10.1109/SHPCC.1994.296665</a>. </div> </td> </tr> <tr class="document-box" id="b47"> <td valign="top" class="td1"> [47] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shroff M, Van De Geijn R A. CollMark: MPI collective communication benchmark. In <i>Proc</i>. <i>the 2000 ICS</i>, June 29–July 2. </div> </td> </tr> <tr class="document-box" id="b48"> <td valign="top" class="td1"> [48] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Rabenseifner R. Optimization of collective reduction operations. In <i>Proc</i>. <i>the 4th Int. Conf. Computational Science</i>, Jun. 2004. DOI: <a href="https://doi.org/10.1007/978-3-540-24685-5_1">10.1007/978-3-540-24685-5_1</a>. </div> </td> </tr> <tr class="document-box" id="b49"> <td valign="top" class="td1"> [49] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Dong J B, Wang S C, Feng F <i>et al</i>. ACCL: Architecting highly scalable distributed training systems with highly efficient collective communication library. <i>IEEE Micro</i>, 2021, 41(5): 85–92. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/MM.2021.3091475" target="_blank">10.1109/MM.2021.3091475</a>. </div> </td> </tr> <tr class="document-box" id="b50"> <td valign="top" class="td1"> [50] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hockney R W. The communication challenge for MPP: Intel paragon and Meiko CS-2. <i>Parallel Computing</i>, 1994, 20(3): 389–398. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1016/S0167-8191(06)80021-9" target="_blank">10.1016/S0167-8191(06)80021-9</a>. </div> </td> </tr> <tr class="document-box" id="b51"> <td valign="top" class="td1"> [51] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Benson G D, Chu C W, Huang Q, Caglar S G. A comparison of MPICH allgather algorithms on switched networks. In <i>Proc</i>. <i>the 10th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface</i>, Oct. 2003, pp.335-343. DOI: <a href="https://doi.org/10.1007/978-3-540-39924-7_47">10.1007/978-3-540-39924-7_47</a>. </div> </td> </tr> <tr class="document-box" id="b52"> <td valign="top" class="td1"> [52] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Almási G, Heidelberger P, Archer C J et al. Optimization of MPI collective communication on BlueGene/L systems. In <i>Proc</i>. <i>the 19th ICS</i>, Jun. 2005, pp.253-262. DOI: <a href="https://doi.org/10.1145/1088149.1088183">10.1145/1088149.1088183</a>. </div> </td> </tr> <tr class="document-box" id="b53"> <td valign="top" class="td1"> [53] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Sergeev A, Del Balso M. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv: 1802.05799, 2018. <a href="https://arxiv.org/abs/1802.05799">https://arxiv.org/abs/1802.05799</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b54"> <td valign="top" class="td1"> [54] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Goyal P, Dollár P, Girshick R et al. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv: 1706.02677, 2017. <a href="https://arxiv.org/abs/1706.02677">https://arxiv.org/abs/1706.02677</a>, Jan.2023. </div> </td> </tr> <tr class="document-box" id="b55"> <td valign="top" class="td1"> [55] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Gupta U, Wu C, Wang X et al. The architectural implications of Facebook’s DNN-based personalized recommendation. In <i>Proc</i>. <i>the 2020 IEEE International Symposium on High Performance Computer Architecture</i> (<i>HPCA</i>), Feb. 2020, pp.488-501. DOI: <a href="https://doi.org/10.1109/HPCA47549.2020.00047">10.1109/HPCA47549.2020.00047</a>. </div> </td> </tr> <tr class="document-box" id="b56"> <td valign="top" class="td1"> [56] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Mudigere D, Hao Y, Huang J et al. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In <i>Proc</i>. <i>the 49th Annual International Symposium on Computer Architecture</i>, Jun. 2022, pp.993-1011. DOI: <a href="https://doi.org/10.1145/3470496.3533727">10.1145/3470496.3533727</a>. </div> </td> </tr> <tr class="document-box" id="b57"> <td valign="top" class="td1"> [57] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Paszke A, Gross S, Massa F et al. Pytorch: An imperative style, high-performance deep learning library. In <i>Proc</i>. <i>the 33rd International Conference on Neural Information Processing Systems</i>, Dec. 2019. </div> </td> </tr> <tr class="document-box" id="b58"> <td valign="top" class="td1"> [58] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Khudia D, Huang J Y, Basu P, Deng S, Liu H, Park J, Smelyanskiy M. FBGEMM: Enabling high-performance low-precision deep learning inference. arXiv: 2101.05615, 2021. <a href="https://arxiv.org/abs/2101.05615">https://arxiv.org/abs/2101.05615</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b59"> <td valign="top" class="td1"> [59] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In <i>Proc</i>. <i>the 2016 IEEE Conference on Computer Vision and Pattern Recognition</i> (<i>CVPR</i>), Jun. 2016, pp.770-778. DOI: <a href="https://doi.org/10.1109/CVPR.2016.90">10.1109/CVPR.2016.90</a>. </div> </td> </tr> <tr class="document-box" id="b60"> <td valign="top" class="td1"> [60] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Deng J, Dong W, Socher R, Li L J, Li K, Li F F. ImageNet: A large-scale hierarchical image database. In <i>Proc</i>. <i>the 2009 CVPR</i>, Jun. 2009, pp.248-255. DOI: <a href="https://doi.org/10.1109/CVPR.2009.5206848">10.1109/CVPR.2009.5206848</a>. </div> </td> </tr> <tr class="document-box" id="b61"> <td valign="top" class="td1"> [61] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Dean J, Corrado G S, Monga R, Chen K, Devin M, Le Q V, Mao M Z, Ranzato M A, Senior A, Tucker P, Yang K, Ng A Y. Large scale distributed deep networks. In <i>Proc</i>. <i>the 25th Int. Conf. Neural Information Processing Systems</i>, Dec. 2012, pp.1223-1231. </div> </td> </tr> <tr class="document-box" id="b62"> <td valign="top" class="td1"> [62] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Abadi M, Barham P, Chen J et al. TensorFlow: A system for large-scale machine learning. In <i>Proc</i>. <i>the 12th USENIX Conference on Operating Systems Design and Implementation</i>, Nov. 2016, pp.265-283. </div> </td> </tr> <tr class="document-box" id="b63"> <td valign="top" class="td1"> [63] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Awan A A, Bédorf J, Chu C H et al. Scalable distributed DNN training using TensorFlow and CUDA-aware MPI: Characterization, designs, and performance evaluation. In <i>Proc</i>. <i>the 19th IEEE/ACM Int. Symp. Cluster</i>, <i>Cloud and Grid Computing</i> (<i>CCGRID</i>), May 2019, pp.498-507. DOI: <a href="https://doi.org/10.1109/CCGRID.2019.00064">10.1109/CCGRID.2019.00064</a>. </div> </td> </tr> <tr class="document-box" id="b64"> <td valign="top" class="td1"> [64] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Biswas R, Lu X Y, Panda D K. Designing a micro-benchmark suite to evaluate gRPC for TensorFlow: Early experiences. In <i>Proc</i>. <i>the 9th Workshop on Big Data Benchmarks</i>, <i>Performance Optimization</i>, <i>and Emerging Hardware</i>, Mar. 2018. </div> </td> </tr> <tr class="document-box" id="b65"> <td valign="top" class="td1"> [65] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Biswas R, Lu X Y, Panda D K. Accelerating TensorFlow with adaptive RDMA-based gRPC. In <i>Proc</i>. <i>the 25th IEEE Int. Conf. High Performance Computing</i> (<i>HiPC</i>), Dec. 2018, pp.2-11. DOI: <a href="https://doi.org/10.1109/HiPC.2018.00010">10.1109/HiPC.2018.00010</a>. </div> </td> </tr> <tr class="document-box" id="b66"> <td valign="top" class="td1"> [66] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Jain A, Awan A A, Subramoni H, Panda D K. Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for high-performance deep learning on Frontera. In <i>Proc</i>. <i>the 3rd IEEE/ACM Workshop on Deep Learning on Supercomputers</i> (<i>DLS</i>), Nov. 2019, pp.76-83. DOI: <a href="https://doi.org/10.1109/DLS49591.2019.00015">10.1109/DLS49591.2019.00015</a>. </div> </td> </tr> <tr class="document-box" id="b67"> <td valign="top" class="td1"> [67] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang Z, Zheng S, Wang Y S <i>et al</i>. MiCS: Near-linear scaling for training gigantic model on public cloud. <i>Proceedings of the VLDB Endowment</i>, 2022, 16(1): 37–50. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.14778/3561261.3561265" target="_blank">10.14778/3561261.3561265</a>. </div> </td> </tr> <tr class="document-box" id="b68"> <td valign="top" class="td1"> [68] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Rajbhandari S, Rasley J, Ruwase O, He Y X. ZeRO: Memory optimizations toward training trillion parameter models. In <i>Proc</i>. <i>the 2020 SC</i>, Nov. 2020. </div> </td> </tr> <tr class="document-box" id="b69"> <td valign="top" class="td1"> [69] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Jia Y, Shelhamer E, Donahue J et al. Caffe: Convolutional architecture for fast feature embedding. In <i>Proc</i>. <i>the 22nd ACM International Conference on Multimedia</i>, Nov. 2014, pp.675-678. DOI: <a href="https://doi.org/10.1145/2647868.2654889">10.1145/2647868.2654889</a>. </div> </td> </tr> <tr class="document-box" id="b70"> <td valign="top" class="td1"> [70] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Seide F, Agarwal A. CNTK: Microsoft’s open-source deep-learning toolkit. In <i>Proc</i>. <i>the 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining</i>, Aug. 2016, p.2135. DOI: <a href="https://doi.org/10.1145/2939672.2945397">10.1145/2939672.2945397</a>. </div> </td> </tr> <tr class="document-box" id="b71"> <td valign="top" class="td1"> [71] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Chen T Q, Li M, Li Y T, Lin M, Wang N Y, Wang M J, Xiao T J, Xu B, Zhang C Y, Zhang Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv: 1512.01274, 2015. <a href="https://arxiv.org/abs/1512.01274">https://arxiv.org/abs/1512.01274</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b72"> <td valign="top" class="td1"> [72] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lin L X, Qiu S H, Yu Z Q, You L, Long X, Sun X Y, Xu J, Wang Z. AIACC-training: Optimizing distributed deep learning training through multi-streamed and concurrent gradient communications. In <i>Proc</i>. <i>the 42nd IEEE Int. Conf. Distributed Computing Systems</i>, Jul. 2022, pp.853-863. DOI: <a href="https://doi.org/10.1109/ICDCS54860.2022.00087">10.1109/ICDCS54860.2022.00087</a>. </div> </td> </tr> <tr class="document-box" id="b73"> <td valign="top" class="td1"> [73] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Cowan M, Maleki S, Musuvathi M et al. MSCCL: Microsoft collective communication library. arXiv: 2201.11840, 2022. <a href="https://arxiv.org/abs/2201.11840v1">https://arxiv.org/abs/2201.11840v1</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b74"> <td valign="top" class="td1"> [74] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shah A, Chidambaram V, Cowan M et al. TACCL: Guiding collective algorithm synthesis using communication sketches. In <i>Proc</i>. <i>the 2023 USENIX Symposium on Networked Systems Design and Implementation</i>, April 2023. </div> </td> </tr> <tr class="document-box" id="b75"> <td valign="top" class="td1"> [75] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Cai Z X, Liu Z Y, Maleki S, Musuvathi M, Mytkowicz T, Nelson J, Saarikivi O. Synthesizing optimal collective algorithms. In <i>Proc</i>. <i>the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming</i>, Feb. 2021, pp.62-75. DOI: <a href="https://doi.org/10.1145/3437801.3441620">10.1145/3437801.3441620</a>. </div> </td> </tr> <tr class="document-box" id="b76"> <td valign="top" class="td1"> [76] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Panda D K, Tomko K, Schulz K, Majumdar A. The MVAPICH project: Evolution and sustainability of an open source production quality MPI library for HPC. In <i>Proc</i>. <i>the Workshop on Sustainable Software for Science</i>: <i>Practice and Experiences</i>, Nov. 2013. </div> </td> </tr> <tr class="document-box" id="b77"> <td valign="top" class="td1"> [77] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Wang G H, Venkataraman S, Phanishayee A et al. Blink: Fast and generic collectives for distributed ML. In <i>Proc</i>. <i>the 2020 Machine Learning and Systems 2</i>, Mar. 2020, pp.172-186. </div> </td> </tr> <tr class="document-box" id="b78"> <td valign="top" class="td1"> [78] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang Z, Chang C K, Lin H B et al. Is network the bottleneck of distributed training? In <i>Proc</i>. <i>the 2020 Workshop on Network Meets AI & ML</i>, Aug. 2020, pp.8-13. DOI: <a href="https://doi.org/10.1145/3405671.3405810">10.1145/3405671.3405810</a>. </div> </td> </tr> <tr class="document-box" id="b79"> <td valign="top" class="td1"> [79] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Wickramasinghe U, Lumsdaine A. A survey of methods for collective communication optimization and tuning. arXiv: 1611.06334, 2016. <a href="https://arxiv.org/abs/1611.06334">https://arxiv.org/abs/1611.06334</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b80"> <td valign="top" class="td1"> [80] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Chan E N, Heimlich M, Purkayastha A, van de Geijn R. Collective communication: Theory, practice, and experience. <i>Concurrency and Computation: Practice and Experience</i>, 2007, 19(13): 1749–1783. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1002/cpe.1206" target="_blank">10.1002/cpe.1206</a>. </div> </td> </tr> <tr class="document-box" id="b81"> <td valign="top" class="td1"> [81] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pješivac-Grbović J, Angskun T, Bosilca G, Fagg G E, Gabriel E, Dongarra J J. Performance analysis of MPI collective operations. <i>Cluster Computing</i>, 2007, 10(2): 127–143. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1007/s10586-007-0012-0" target="_blank">10.1007/s10586-007-0012-0</a>. </div> </td> </tr> <tr class="document-box" id="b82"> <td valign="top" class="td1"> [82] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Vadhiyar S S, Fagg G E, Dongarra J. Automatically tuned collective communications. In <i>Proc</i>. <i>the 2000 ACM/IEEE Conference on Supercomputing</i>, Nov. 2000. DOI: <a href="https://doi.org/10.1109/SC.2000.10024">10.1109/SC.2000.10024</a>. </div> </td> </tr> <tr class="document-box" id="b83"> <td valign="top" class="td1"> [83] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Verbraeken J, Wolting M, Katzy J <i>et al</i>. A survey on distributed machine learning. <i>ACM Computing Surveys</i>, 2020, 53(2): 30. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/3377454" target="_blank">10.1145/3377454</a>. </div> </td> </tr> <tr class="document-box" id="b84"> <td valign="top" class="td1"> [84] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Wang M, Fu W J, He X N, Hao S J, Wu X D. A survey on large-scale machine learning. <i>IEEE Trans. Knowledge and Data Engineering</i>, 2022, 34(6): 2574–2594. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/TKDE.2020.3015777" target="_blank">10.1109/TKDE.2020.3015777</a>. </div> </td> </tr> <tr class="document-box" id="b85"> <td valign="top" class="td1"> [85] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. <i>ACM Computing Surveys</i>, 2019, 52(4): Article No. 65. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/3320060" target="_blank">10.1145/3320060</a>. </div> </td> </tr> <tr class="document-box" id="b86"> <td valign="top" class="td1"> [86] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Mayer R, Jacobsen H A. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. <i>ACM Computing Surveys</i>, 2020, 53(1): Article No. 3. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/3363554" target="_blank">10.1145/3363554</a>. </div> </td> </tr> <tr class="document-box" id="b87"> <td valign="top" class="td1"> [87] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ouyang S, Dong D Z, Xu Y M, Xiao L Q. Communication optimization strategies for distributed deep neural network training: A survey. <i>Journal of Parallel and Distributed Computing</i>, 2021, 149: 52–65. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1016/j.jpdc.2020.11.005" target="_blank">10.1016/j.jpdc.2020.11.005</a>. </div> </td> </tr> <tr class="document-box" id="b88"> <td valign="top" class="td1"> [88] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lee S, Purushwalkam S, Cogswell M, Crandall D, Batra D. Why M heads are better than one: Training a diverse ensemble of deep networks. arXiv: 1511.06314, 2015. <a href="https://arxiv.org/abs/1511.06314">https://arxiv.org/abs/1511.06314</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b89"> <td valign="top" class="td1"> [89] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In <i>Proc</i>. <i>the 25th International Conference on Neural Information Processing Systems</i>, Dec. 2012, pp.1097-1105. </div> </td> </tr> <tr class="document-box" id="b90"> <td valign="top" class="td1"> [90] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In <i>Proc</i>. <i>the 2015 </i><i>CVPR</i>, Jun. 2015. DOI: <a href="https://doi.org/10.1109/CVPR.2015.7298594">10.1109/CVPR.2015.7298594</a>. </div> </td> </tr> <tr class="document-box" id="b91"> <td valign="top" class="td1"> [91] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In <i>Proc</i>. <i>the 2016 </i><i>CVPR</i>, Jun. 2016, pp.770-778. DOI: <a href="https://doi.org/10.1109/CVPR.2016.90">10.1109/CVPR.2016.90</a>. </div> </td> </tr> <tr class="document-box" id="b92"> <td valign="top" class="td1"> [92] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shi S H, Wang Q, Chu X W. Performance modeling and evaluation of distributed deep learning frameworks on GPUs. In <i>Proc. the </i><i>DASC/PiCom/DataCom/CyberSciTech</i>, Aug. 2018, pp.949-957. DOI: <a href="https://doi.org/10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.000-4">10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.000-4</a>. </div> </td> </tr> <tr class="document-box" id="b93"> <td valign="top" class="td1"> [93] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hoefler T, Moor D. Energy, memory, and runtime tradeoffs for implementing collective communication operations. <i>Supercomputing Frontiers and Innovations</i>, 2014, 1(2): 58–75. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.14529/jsfi140204" target="_blank">10.14529/jsfi140204</a>. </div> </td> </tr> </tbody> </table>
[1] Shao-Feng Zhao, Fang Wang, Bo Liu, Dan Feng, and Yang Liu. LayCO: Achieving Least Lossy Accuracy for Most Efficient RRAM-Based Deep Neural Network Accelerator via Layer-Centric Co-Optimization [J]. Journal of Computer Science and Technology, 2023, 38(2): 328-348.
[2] Cheng-Li Peng and Jia-Yi Ma. Real-Time Semantic Segmentation via an Efficient Multi-Column Network [J]. Journal of Computer Science and Technology, 2022, 37(6): 1478-1491.
[3] Lei Liu, Xiu Ma, Hua-Xiao Liu, Guang-Li Li, and Lei Liu. FlexPDA: A Flexible Programming Framework for Deep Learning Accelerators [J]. Journal of Computer Science and Technology, 2022, 37(5): 1200-1220.
[4] Xin Zhang, Siyuan Lu, Shui-Hua Wang, Xiang Yu, Su-Jing Wang, Lun Yao, Yi Pan, and Yu-Dong Zhang. Diagnosis of COVID-19 Pneumonia via a Novel Deep Learning Architecture [J]. Journal of Computer Science and Technology, 2022, 37(2): 330-343.
[5] Songjie Niu, Shimin Chen. TransGPerf: Exploiting Transfer Learning for Modeling Distributed Graph Computation Performance [J]. Journal of Computer Science and Technology, 2021, 36(4): 778-791.
[6] Sheng-Luan Hou, Xi-Kun Huang, Chao-Qun Fei, Shu-Han Zhang, Yang-Yang Li, Qi-Lin Sun, Chuan-Qing Wang. A Survey of Text Summarization Approaches Based on Deep Learning [J]. Journal of Computer Science and Technology, 2021, 36(3): 633-663.
[7] Lan Chen, Juntao Ye, Xiaopeng Zhang. Multi-Feature Super-Resolution Network for Cloth Wrinkle Synthesis [J]. Journal of Computer Science and Technology, 2021, 36(3): 478-493.
[8] Yu-Jie Yuan, Yukun Lai, Tong Wu, Lin Gao, Li-Gang Liu. A Revisit of Shape Editing Techniques: From the Geometric to the Neural Viewpoint [J]. Journal of Computer Science and Technology, 2021, 36(3): 520-554.
[9] Wei Du, Yu Sun, Hui-Min Bao, Liang Chen, Ying Li, Yan-Chun Liang. DeepHBSP: A Deep Learning Framework for Predicting Human Blood-Secretory Proteins Using Transfer Learning [J]. Journal of Computer Science and Technology, 2021, 36(2): 234-247.
[10] Jun Gao, Paul Liu, Guang-Di Liu, Le Zhang. Robust Needle Localization and Enhancement Algorithm for Ultrasound by Deep Learning and Beam Steering Methods [J]. Journal of Computer Science and Technology, 2021, 36(2): 334-346.
[11] Hua Chen, Juan Liu, Qing-Man Wen, Zhi-Qun Zuo, Jia-Sheng Liu, Jing Feng, Bao-Chuan Pang, Di Xiao. CytoBrain: Cervical Cancer Screening System Based on Deep Learning Technology [J]. Journal of Computer Science and Technology, 2021, 36(2): 347-360.
[12] Nuo Qun, Hang Yan, Xi-Peng Qiu, Xuan-Jing Huang. Chinese Word Segmentation via BiLSTM+Semi-CRF with Relay Node [J]. Journal of Computer Science and Technology, 2020, 35(5): 1115-1126.
[13] Andrea Caroppo, Alessandro Leone, Pietro Siciliano. Comparison Between Deep Learning Models and Traditional Machine Learning Approaches for Facial Expression Recognition in Ageing Adults [J]. Journal of Computer Science and Technology, 2020, 35(5): 1127-1146.
[14] Dun Liang, Yuan-Chen Guo, Shao-Kui Zhang, Tai-Jiang Mu, Xiaolei Huang. Lane Detection: A Survey with New Results [J]. Journal of Computer Science and Technology, 2020, 35(3): 493-505.
[15] Zheng Zeng, Lu Wang, Bei-Bei Wang, Chun-Meng Kang, Yan-Ning Xu. Denoising Stochastic Progressive Photon Mapping Renderings Using a Multi-Residual Network [J]. Journal of Computer Science and Technology, 2020, 35(3): 506-521.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Li Wei;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] Li Wanxue;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[4] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[5] C.Y.Chung; H.R.Hwa;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[6] Sun Zhongxiu; Shang Lujun;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[7] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[8] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[9] Jin Lan; Yang Yuanyuan;. A Modified Version of Chordal Ring[J]. , 1986, 1(3): 15 -32 .
[10] Pan Qijing;. A Routing Algorithm with Candidate Shortest Path[J]. , 1986, 1(3): 33 -52 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved