<table class="reference-tab" style="background-color:#FFFFFF;width:914.104px;color:#333333;font-family:Calibri, Arial, 微软雅黑, "font-size:16px;">
<tbody>
<tr class="document-box" id="b1">
<td valign="top" class="td1">
[1]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Hwang K, Xu Z W. Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill, 1998.
</div>
</td>
</tr>
<tr class="document-box" id="b2">
<td valign="top" class="td1">
[2]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Brown T B, Mann B, Ryder N et al. Language models are few-shot learners. In <i>Proc</i>. <i>the 34th Int. Conf. Neural Information Processing Systems</i>, Dec. 2020, pp.1877-1901.
</div>
</td>
</tr>
<tr class="document-box" id="b3">
<td valign="top" class="td1">
[3]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Naumov M, Mudigere D, Shi H J M et al. Deep learning recommendation model for personalization and recommendation systems. arXiv: 1906.00091, 2019. <a href="https://arxiv.org/abs/1906.00091">https://arxiv.org/abs/1906.00091</a>, Jan. 2023.
</div>
</td>
</tr>
<tr class="document-box" id="b4">
<td valign="top" class="td1">
[4]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Bayatpour M, Chakraborty S, Subramoni H, Lu X Y, Panda D K. Scalable reduction collectives with data partitioning-based multi-leader design. In <i>Proc</i>. <i>the 2017 Int. Conf. High Performance Computing</i>, <i>Networking</i>, <i>Storage and Analysis</i> (<i>SC</i>), Nov. 2017. DOI: <a href="https://doi.org/10.1145/3126908.3126954">10.1145/3126908.3126954</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b5">
<td valign="top" class="td1">
[5]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Chu C H, Lu X Y, Awan A A, Subramoni H, Hashmi J, Elton B, Panda D K. Efficient and scalable multi-source streaming broadcast on GPU clusters for deep learning. In <i>Proc</i>. <i>the 46th Int. Conf. Parallel Processing</i> (<i>ICPP</i>), Aug. 2017, pp.161-170. DOI: <a href="https://doi.org/10.1109/ICPP.2017.25">10.1109/ICPP.2017.25</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b6">
<td valign="top" class="td1">
[6]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Panda D K, Lu X Y, Shankar D. High-Performance Big Data Computing. The MIT Press, 2022.
</div>
</td>
</tr>
<tr class="document-box" id="b7">
<td valign="top" class="td1">
[7]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Lu X Y, Islam N S, Wasi-Ur-Rahman et al. High-performance design of Hadoop RPC with RDMA over InfiniBand. In <i>Proc</i>. <i>the 42nd ICPP</i>, Oct. 2013, pp.641-650. DOI: <a href="https://doi.org/10.1109/ICPP.2013.78">10.1109/ICPP.2013.78</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b8">
<td valign="top" class="td1">
[8]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Wasi-Ur-Rahman, Lu X Y, Islam N S, Panda D K. HOMR: A hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects. In <i>Proc</i>. <i>the 28th ACM Int. Conf. Supercomputing</i> (<i>ICS</i>), Jun. 2014, pp.33-42. DOI: <a href="https://doi.org/10.1145/2597652.2597684">10.1145/2597652.2597684</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b9">
<td valign="top" class="td1">
[9]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Islam N S, Lu X Y, Wasi-Ur-Rahman, Panda D K. SOR-HDFS: A SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS. In <i>Proc</i>. <i>the 23rd Int. Symp. High–Performance Parallel and Distributed Computing</i>, Jun. 2014, pp.261-264. DOI: <a href="https://doi.org/10.1145/2600212.2600715">10.1145/2600212.2600715</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b10">
<td valign="top" class="td1">
[10]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Lu X Y, Shankar D, Gugnani S, Panda D K. High-performance design of Apache Spark with RDMA and its benefits on various workloads. In <i>Proc</i>. <i>the 2016 IEEE Int. Conf. Big Data</i>, Dec. 2016, pp.253-262. DOI: <a href="https://doi.org/10.1109/BigData.2016.7840611">10.1109/BigData.2016.7840611</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b11">
<td valign="top" class="td1">
[11]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Kalia A, Kaminsky M, Andersen D G. Using RDMA efficiently for key-value services. In <i>Proc</i>. <i>the 2014 ACM Conference on SIGCOMM</i>, Aug. 2014, pp.295-306. DOI: <a href="https://doi.org/10.1145/2619239.2626299">10.1145/2619239.2626299</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b12">
<td valign="top" class="td1">
[12]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Shankar D, Lu X Y, Panda D K. SCOR-KV: SIMD-aware client-centric and optimistic RDMA-based key-value store for emerging CPU architectures. In <i>Proc</i>. <i>the 2019 SC</i>, Dec. 2019, pp.257-266. DOI: <a href="https://doi.org/10.1109/HiPC.2019.00040">10.1109/HiPC.2019.00040</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b13">
<td valign="top" class="td1">
[13]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Dragojević A, Narayanan D, Hodson O, Castro M. FaRM: Fast remote memory. In <i>Proc</i>. <i>the 11th USENIX Symposium on Networked Systems Design and Implementation</i>, Apr. 2014, pp.401-414.
</div>
</td>
</tr>
<tr class="document-box" id="b14">
<td valign="top" class="td1">
[14]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Shankar D, Lu X Y, Islam N, Wasi-Ur-Rahman, Panda D K. High-performance hybrid key-value store on modern clusters with RDMA interconnects and SSDs: Non-blocking extensions, designs, and benefits. In <i>Proc</i>. <i>the 2016 IEEE International Parallel and Distributed Processing Symposium</i> (<i>IPDPS</i>), May 2016, pp.393-402. DOI: <a href="https://doi.org/10.1109/IPDPS.2016.112">10.1109/IPDPS.2016.112</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b15">
<td valign="top" class="td1">
[15]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Gugnani S, Lu X Y, Panda D K. Swift-X: Accelerating OpenStack swift with RDMA for building an efficient HPC cloud. In <i>Proc</i>. <i>the 17th IEEE/ACM International Symposium on Cluster</i>, <i>Cloud and Grid Computing</i>, May 2017, pp.238-247. DOI: <a href="https://doi.org/10.1109/CCGRID.2017.103">10.1109/CCGRID.2017.103</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b16">
<td valign="top" class="td1">
[16]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Gugnani S, Lu X Y, Panda D K. Designing virtualization-aware and automatic topology detection schemes for accelerating Hadoop on SR-IOV-enabled clouds. In <i>Proc</i>. <i>the 2016 IEEE Int. Conf. Cloud Computing Technology and Science</i>, Dec. 2016, pp.152-159. DOI: <a href="https://doi.org/10.1109/CloudCom.2016.0037">10.1109/CloudCom.2016.0037</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b17">
<td valign="top" class="td1">
[17]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Zhang J, Lu X Y, Panda D K. Designing locality and NUMA aware MPI runtime for nested virtualization based HPC cloud with SR-IOV enabled InfiniBand. In <i>Proc</i>. <i>the 13th ACM SIGPLAN/SIGOPS Int. Conf. Virtual ution Environments</i>, Apr. 2017, pp.187-200. DOI: <a href="https://doi.org/10.1145/3050748.3050765">10.1145/3050748.3050765</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b18">
<td valign="top" class="td1">
[18]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Chu C H, Lu X Y, Awan A A <i>et al</i>. Exploiting hardware multicast and GPUDirect RDMA for efficient broadcast. <i>IEEE Trans. Parallel and Distributed Systems</i>, 2019, 30(3): 575–588. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/TPDS.2018.2867222" target="_blank">10.1109/TPDS.2018.2867222</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b19">
<td valign="top" class="td1">
[19]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Zhang J, Lu X Y, Chu C H, Panda D K. C-GDR: High-performance container-aware GPUDirect MPI communication schemes on RDMA networks. In <i>Proc</i>. <i>the 2019 IPDPS</i>, May 2019, pp.242-251. DOI: <a href="https://doi.org/10.1109/IPDPS.2019.00034">10.1109/IPDPS.2019.00034</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b20">
<td valign="top" class="td1">
[20]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Li Y K, Qi H, Lu G, Jin F, Guo Y F, Lu X Y. Understanding hot interconnects with an extensive benchmark survey. <i>BenchCouncil Trans. Benchmarks, Standards and Evaluations</i>, 2022, 2(3): 100074. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1016/J.TBENCH.2022.100074" target="_blank">10.1016/J.TBENCH.2022.100074</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b21">
<td valign="top" class="td1">
[21]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Pacheco P. An Introduction to Parallel Programming. Elsevier, 2011. DOI: <a href="https://doi.org/10.1016/C2009-0-18471-4">10.1016/C2009-0-18471-4</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b22">
<td valign="top" class="td1">
[22]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Gong Y F, He B S, Zhong J L. Network performance aware MPI collective communication operations in the cloud. <i>IEEE Trans. Parallel and Distributed Systems</i>, 2015, 26(11): 3079–3089. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/TPDS.2013.96" target="_blank">10.1109/TPDS.2013.96</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b23">
<td valign="top" class="td1">
[23]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Brown K A, Domke J, Matsuoka S. Hardware-centric analysis of network performance for MPI applications. In <i>Proc</i>. <i>the 21st IEEE Int. Conf. Parallel and Distributed Systems</i> (<i>ICPADS</i>), Dec. 2015, pp.692-699. DOI: <a href="https://doi.org/10.1109/ICPADS.2015.92">10.1109/ICPADS.2015.92</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b24">
<td valign="top" class="td1">
[24]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Katseff H P. Incomplete hypercubes. <i>IEEE Trans. Computers</i>, 1988, 37(5): 604–608. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/12.4611" target="_blank">10.1109/12.4611</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b25">
<td valign="top" class="td1">
[25]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Kalb J L, Lee D S. Network topology analysis. Technical Report SAND2008-0069. Sandia National Laboratories, Albuquerque, New Mexico, 2008. <a href="https://digital.library.unt.edu/ark:/67531/metadc845229/m2/1/high_res_d/1028919.pdf">https://digital.library.unt.edu/ark:/67531/metadc845229/m2/1/high_res_d/1028919.pdf</a>, Jan. 2023.
</div>
</td>
</tr>
<tr class="document-box" id="b26">
<td valign="top" class="td1">
[26]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Kim J, Kim H. Router microarchitecture and scalability of ring topology in on-chip networks. In <i>Proc</i>. <i>the 2nd Int. Workshop on Network on Chip Architectures</i>, Dec. 2009, pp.5-10. DOI: <a href="https://doi.org/10.1145/1645213.1645217">10.1145/1645213.1645217</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b27">
<td valign="top" class="td1">
[27]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Bouknight W J, Denenberg S A, McIntyre D E, Randall J M, Sameh A H, Slotnick D L. The Illiac IV system. <i>Proceedings of the IEEE</i>, 1972, 60(4): 369–388. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/PROC.1972.8647" target="_blank">10.1109/PROC.1972.8647</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b28">
<td valign="top" class="td1">
[28]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Cheng S H, Zhong W, Isaacs K E, Mueller K. Visualizing the topology and data traffic of multi-dimensional torus interconnect networks. <i>IEEE Access</i>, 2018, 6: 57191–57204. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/ACCESS.2018.2872344" target="_blank">10.1109/ACCESS.2018.2872344</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b29">
<td valign="top" class="td1">
[29]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Romanov A Y, Amerikanov A A, Lezhnev E V. Analysis of approaches for synthesis of networks-on-chip by using circulant topologies. <i>Journal of Physics: Conference Series</i>, 2018, 1050(1): 012071. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1088/1742-6596/1050/1/012071" target="_blank">10.1088/1742-6596/1050/1/012071</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b30">
<td valign="top" class="td1">
[30]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Ravankar A A, Sedukhin S G. Mesh-of-Tori: A novel interconnection network for frontal plane cellular processors. In <i>Proc</i>. <i>the 1st Int. Conf. Networking and Computing</i>, Nov. 2010, pp.281-284. DOI: <a href="https://doi.org/10.1109/IC-NC.2010.30">10.1109/IC-NC.2010.30</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b31">
<td valign="top" class="td1">
[31]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Pham P H, Mau P, Kim C. A 64-PE folded-torus intra-chip communication fabric for guaranteed throughput in network-on-chip based applications. In <i>Proc</i>. <i>the 2009 IEEE Custom Integrated Circuits Conference</i>, Sept. 2009, pp.645-648. DOI: <a href="https://doi.org/10.1109/CICC.2009.5280748">10.1109/CICC.2009.5280748</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b32">
<td valign="top" class="td1">
[32]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Al-Fares M, Loukissas A, Vahdat A. A scalable, commodity data center network architecture. <i>ACM SIGCOMM Computer Communication Review</i>, 2008, 38(4): 63–74. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/1402946.1402967" target="_blank">10.1145/1402946.1402967</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b33">
<td valign="top" class="td1">
[33]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Leiserson C E, Abuhamdeh Z S, Douglas D C et al. The network architecture of the connection machine CM-5 (extended abstract). In <i>Proc</i>. <i>the 4th Annual ACM Symposium on Parallel Algorithms and Architectures</i>, Jun. 1992, pp.272-285. DOI: <a href="https://doi.org/10.1145/140901.141883">10.1145/140901.141883</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b34">
<td valign="top" class="td1">
[34]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Valerio M, Moser L E, Melliar-Smith P M. Recursively scalable fat-trees as interconnection networks. In <i>Proc</i>. <i>the 13th IEEE Annual International Phoenix Conference on Computers and Communications</i>, Apr. 1994. DOI: <a href="https://doi.org/10.1109/PCCC.1994.504091">10.1109/PCCC.1994.504091</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b35">
<td valign="top" class="td1">
[35]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Nienaber W. Effective routing on fat-tree topologies [Ph. D. Thesis]. Florida State University, Tallahassee, 2014.
</div>
</td>
</tr>
<tr class="document-box" id="b36">
<td valign="top" class="td1">
[36]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Prisacari B, Rodriguez G, Minkenberg C, Hoefler T. Bandwidth-optimal all-to-all exchanges in fat tree networks. In <i>Proc</i>. <i>the 27th ICS</i>, Jun. 2013, pp.139-148. DOI: <a href="https://doi.org/10.1145/2464996.2465434">10.1145/2464996.2465434</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b37">
<td valign="top" class="td1">
[37]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Li Y, Pan D. OpenFlow based load balancing for fat-tree networks with multipath support. In <i>Proc</i>. <i>the 12th IEEE International Conference on Communications</i>, Jun. 2013.
</div>
</td>
</tr>
<tr class="document-box" id="b38">
<td valign="top" class="td1">
[38]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Kim J, Dally W J, Scott S, Abts D. Technology-driven, highly-scalable dragonfly topology. In <i>Proc</i>. <i>the 2008 International Symposium on Computer Architecture</i>, Jun. 2008, pp.77-88. DOI: <a href="https://doi.org/10.1109/ISCA.2008.19">10.1109/ISCA.2008.19</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b39">
<td valign="top" class="td1">
[39]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Teh M Y, Wilke J J, Bergman K, Rumley S. Design space exploration of the dragonfly topology. In <i>Lecture Notes in Computer Science 10524</i>, Kunkel J, Yokota R, Taufer M et al. (eds.), Springer. pp.57-74. DOI: <a href="https://doi.org/10.1007/978-3-319-67630-2_5">10.1007/978-3-319-67630-2_5</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b40">
<td valign="top" class="td1">
[40]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Prisacari B, Rodriguez G, Garcia M, Vallejo E, Beivide R, Minkenberg C. Performance implications of remote-only load balancing under adversarial traffic in dragonflies. In <i>Proc</i>. <i>the 8th International Workshop on Interconnection Network Architecture</i>: <i>On-Chip</i>, <i>Multi-Chip</i>, Jan. 2014. DOI: <a href="https://doi.org/10.1145/2556857.2556860">10.1145/2556857.2556860</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b41">
<td valign="top" class="td1">
[41]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Shpiner A, Haramaty Z, Eliad S, Zdornov V, Gafni B, Zahavi E. Dragonfly+: Low cost topology for scaling datacenters. In <i>Proc</i>. <i>the 3rd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era</i> (<i>HiPINEB</i>), Feb. 2017. DOI: <a href="https://doi.org/10.1109/HiPINEB.2017.11">10.1109/HiPINEB.2017.11</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b42">
<td valign="top" class="td1">
[42]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Bruck J, Ho C T, Kipnis S, Weathersby D. Efficient algorithms for all-to-all communications in multi-port message-passing systems. In <i>Proc</i>. <i>the 6th Annual ACM Symposium on Parallel Algorithms and Architectures</i>, Aug. 1994, pp.298-309. DOI: <a href="https://doi.org/10.1145/181014.181756">10.1145/181014.181756</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b43">
<td valign="top" class="td1">
[43]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Thakur R, Rabenseifner R, Gropp W. Optimization of collective communication operations in MPICH. <i>The International Journal of High Performance Computing Applications</i>, 2005, 19(1): 49–66. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1177/1094342005051521" target="_blank">10.1177/1094342005051521</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b44">
<td valign="top" class="td1">
[44]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Pjesivac-Grbovic J. Towards automatic and adaptive optimizations of MPI collective operations [Ph.D. Thesis]. University of Tennessee, Knoxville, 2007.
</div>
</td>
</tr>
<tr class="document-box" id="b45">
<td valign="top" class="td1">
[45]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Huse L P. Collective communication on dedicated clusters of workstations. In <i>Proc</i>. <i>the 6th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface</i>, Sept. 1999, pp.469-476. DOI: <a href="https://doi.org/10.1007/3-540-48158-3_58">10.1007/3-540-48158-3_58</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b46">
<td valign="top" class="td1">
[46]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Barnett M, Shuler L, van De Geijn R, Gupta S, Payne D G, Watts J. Interprocessor collective communication library (InterCom). In <i>Proc</i>. <i>the IEEE Scalable High Performance Computing Conference</i>, May 1994, pp.357-364. DOI: <a href="https://doi.org/10.1109/SHPCC.1994.296665">10.1109/SHPCC.1994.296665</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b47">
<td valign="top" class="td1">
[47]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Shroff M, Van De Geijn R A. CollMark: MPI collective communication benchmark. In <i>Proc</i>. <i>the 2000 ICS</i>, June 29–July 2.
</div>
</td>
</tr>
<tr class="document-box" id="b48">
<td valign="top" class="td1">
[48]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Rabenseifner R. Optimization of collective reduction operations. In <i>Proc</i>. <i>the 4th Int. Conf. Computational Science</i>, Jun. 2004. DOI: <a href="https://doi.org/10.1007/978-3-540-24685-5_1">10.1007/978-3-540-24685-5_1</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b49">
<td valign="top" class="td1">
[49]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Dong J B, Wang S C, Feng F <i>et al</i>. ACCL: Architecting highly scalable distributed training systems with highly efficient collective communication library. <i>IEEE Micro</i>, 2021, 41(5): 85–92. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/MM.2021.3091475" target="_blank">10.1109/MM.2021.3091475</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b50">
<td valign="top" class="td1">
[50]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Hockney R W. The communication challenge for MPP: Intel paragon and Meiko CS-2. <i>Parallel Computing</i>, 1994, 20(3): 389–398. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1016/S0167-8191(06)80021-9" target="_blank">10.1016/S0167-8191(06)80021-9</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b51">
<td valign="top" class="td1">
[51]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Benson G D, Chu C W, Huang Q, Caglar S G. A comparison of MPICH allgather algorithms on switched networks. In <i>Proc</i>. <i>the 10th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface</i>, Oct. 2003, pp.335-343. DOI: <a href="https://doi.org/10.1007/978-3-540-39924-7_47">10.1007/978-3-540-39924-7_47</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b52">
<td valign="top" class="td1">
[52]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Almási G, Heidelberger P, Archer C J et al. Optimization of MPI collective communication on BlueGene/L systems. In <i>Proc</i>. <i>the 19th ICS</i>, Jun. 2005, pp.253-262. DOI: <a href="https://doi.org/10.1145/1088149.1088183">10.1145/1088149.1088183</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b53">
<td valign="top" class="td1">
[53]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Sergeev A, Del Balso M. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv: 1802.05799, 2018. <a href="https://arxiv.org/abs/1802.05799">https://arxiv.org/abs/1802.05799</a>, Jan. 2023.
</div>
</td>
</tr>
<tr class="document-box" id="b54">
<td valign="top" class="td1">
[54]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Goyal P, Dollár P, Girshick R et al. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv: 1706.02677, 2017. <a href="https://arxiv.org/abs/1706.02677">https://arxiv.org/abs/1706.02677</a>, Jan.2023.
</div>
</td>
</tr>
<tr class="document-box" id="b55">
<td valign="top" class="td1">
[55]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Gupta U, Wu C, Wang X et al. The architectural implications of Facebook’s DNN-based personalized recommendation. In <i>Proc</i>. <i>the 2020 IEEE International Symposium on High Performance Computer Architecture</i> (<i>HPCA</i>), Feb. 2020, pp.488-501. DOI: <a href="https://doi.org/10.1109/HPCA47549.2020.00047">10.1109/HPCA47549.2020.00047</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b56">
<td valign="top" class="td1">
[56]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Mudigere D, Hao Y, Huang J et al. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In <i>Proc</i>. <i>the 49th Annual International Symposium on Computer Architecture</i>, Jun. 2022, pp.993-1011. DOI: <a href="https://doi.org/10.1145/3470496.3533727">10.1145/3470496.3533727</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b57">
<td valign="top" class="td1">
[57]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Paszke A, Gross S, Massa F et al. Pytorch: An imperative style, high-performance deep learning library. In <i>Proc</i>. <i>the 33rd International Conference on Neural Information Processing Systems</i>, Dec. 2019.
</div>
</td>
</tr>
<tr class="document-box" id="b58">
<td valign="top" class="td1">
[58]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Khudia D, Huang J Y, Basu P, Deng S, Liu H, Park J, Smelyanskiy M. FBGEMM: Enabling high-performance low-precision deep learning inference. arXiv: 2101.05615, 2021. <a href="https://arxiv.org/abs/2101.05615">https://arxiv.org/abs/2101.05615</a>, Jan. 2023.
</div>
</td>
</tr>
<tr class="document-box" id="b59">
<td valign="top" class="td1">
[59]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In <i>Proc</i>. <i>the 2016 IEEE Conference on Computer Vision and Pattern Recognition</i> (<i>CVPR</i>), Jun. 2016, pp.770-778. DOI: <a href="https://doi.org/10.1109/CVPR.2016.90">10.1109/CVPR.2016.90</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b60">
<td valign="top" class="td1">
[60]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Deng J, Dong W, Socher R, Li L J, Li K, Li F F. ImageNet: A large-scale hierarchical image database. In <i>Proc</i>. <i>the 2009 CVPR</i>, Jun. 2009, pp.248-255. DOI: <a href="https://doi.org/10.1109/CVPR.2009.5206848">10.1109/CVPR.2009.5206848</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b61">
<td valign="top" class="td1">
[61]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Dean J, Corrado G S, Monga R, Chen K, Devin M, Le Q V, Mao M Z, Ranzato M A, Senior A, Tucker P, Yang K, Ng A Y. Large scale distributed deep networks. In <i>Proc</i>. <i>the 25th Int. Conf. Neural Information Processing Systems</i>, Dec. 2012, pp.1223-1231.
</div>
</td>
</tr>
<tr class="document-box" id="b62">
<td valign="top" class="td1">
[62]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Abadi M, Barham P, Chen J et al. TensorFlow: A system for large-scale machine learning. In <i>Proc</i>. <i>the 12th USENIX Conference on Operating Systems Design and Implementation</i>, Nov. 2016, pp.265-283.
</div>
</td>
</tr>
<tr class="document-box" id="b63">
<td valign="top" class="td1">
[63]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Awan A A, Bédorf J, Chu C H et al. Scalable distributed DNN training using TensorFlow and CUDA-aware MPI: Characterization, designs, and performance evaluation. In <i>Proc</i>. <i>the 19th IEEE/ACM Int. Symp. Cluster</i>, <i>Cloud and Grid Computing</i> (<i>CCGRID</i>), May 2019, pp.498-507. DOI: <a href="https://doi.org/10.1109/CCGRID.2019.00064">10.1109/CCGRID.2019.00064</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b64">
<td valign="top" class="td1">
[64]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Biswas R, Lu X Y, Panda D K. Designing a micro-benchmark suite to evaluate gRPC for TensorFlow: Early experiences. In <i>Proc</i>. <i>the 9th Workshop on Big Data Benchmarks</i>, <i>Performance Optimization</i>, <i>and Emerging Hardware</i>, Mar. 2018.
</div>
</td>
</tr>
<tr class="document-box" id="b65">
<td valign="top" class="td1">
[65]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Biswas R, Lu X Y, Panda D K. Accelerating TensorFlow with adaptive RDMA-based gRPC. In <i>Proc</i>. <i>the 25th IEEE Int. Conf. High Performance Computing</i> (<i>HiPC</i>), Dec. 2018, pp.2-11. DOI: <a href="https://doi.org/10.1109/HiPC.2018.00010">10.1109/HiPC.2018.00010</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b66">
<td valign="top" class="td1">
[66]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Jain A, Awan A A, Subramoni H, Panda D K. Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for high-performance deep learning on Frontera. In <i>Proc</i>. <i>the 3rd IEEE/ACM Workshop on Deep Learning on Supercomputers</i> (<i>DLS</i>), Nov. 2019, pp.76-83. DOI: <a href="https://doi.org/10.1109/DLS49591.2019.00015">10.1109/DLS49591.2019.00015</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b67">
<td valign="top" class="td1">
[67]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Zhang Z, Zheng S, Wang Y S <i>et al</i>. MiCS: Near-linear scaling for training gigantic model on public cloud. <i>Proceedings of the VLDB Endowment</i>, 2022, 16(1): 37–50. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.14778/3561261.3561265" target="_blank">10.14778/3561261.3561265</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b68">
<td valign="top" class="td1">
[68]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Rajbhandari S, Rasley J, Ruwase O, He Y X. ZeRO: Memory optimizations toward training trillion parameter models. In <i>Proc</i>. <i>the 2020 SC</i>, Nov. 2020.
</div>
</td>
</tr>
<tr class="document-box" id="b69">
<td valign="top" class="td1">
[69]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Jia Y, Shelhamer E, Donahue J et al. Caffe: Convolutional architecture for fast feature embedding. In <i>Proc</i>. <i>the 22nd ACM International Conference on Multimedia</i>, Nov. 2014, pp.675-678. DOI: <a href="https://doi.org/10.1145/2647868.2654889">10.1145/2647868.2654889</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b70">
<td valign="top" class="td1">
[70]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Seide F, Agarwal A. CNTK: Microsoft’s open-source deep-learning toolkit. In <i>Proc</i>. <i>the 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining</i>, Aug. 2016, p.2135. DOI: <a href="https://doi.org/10.1145/2939672.2945397">10.1145/2939672.2945397</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b71">
<td valign="top" class="td1">
[71]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Chen T Q, Li M, Li Y T, Lin M, Wang N Y, Wang M J, Xiao T J, Xu B, Zhang C Y, Zhang Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv: 1512.01274, 2015. <a href="https://arxiv.org/abs/1512.01274">https://arxiv.org/abs/1512.01274</a>, Jan. 2023.
</div>
</td>
</tr>
<tr class="document-box" id="b72">
<td valign="top" class="td1">
[72]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Lin L X, Qiu S H, Yu Z Q, You L, Long X, Sun X Y, Xu J, Wang Z. AIACC-training: Optimizing distributed deep learning training through multi-streamed and concurrent gradient communications. In <i>Proc</i>. <i>the 42nd IEEE Int. Conf. Distributed Computing Systems</i>, Jul. 2022, pp.853-863. DOI: <a href="https://doi.org/10.1109/ICDCS54860.2022.00087">10.1109/ICDCS54860.2022.00087</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b73">
<td valign="top" class="td1">
[73]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Cowan M, Maleki S, Musuvathi M et al. MSCCL: Microsoft collective communication library. arXiv: 2201.11840, 2022. <a href="https://arxiv.org/abs/2201.11840v1">https://arxiv.org/abs/2201.11840v1</a>, Jan. 2023.
</div>
</td>
</tr>
<tr class="document-box" id="b74">
<td valign="top" class="td1">
[74]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Shah A, Chidambaram V, Cowan M et al. TACCL: Guiding collective algorithm synthesis using communication sketches. In <i>Proc</i>. <i>the 2023 USENIX Symposium on Networked Systems Design and Implementation</i>, April 2023.
</div>
</td>
</tr>
<tr class="document-box" id="b75">
<td valign="top" class="td1">
[75]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Cai Z X, Liu Z Y, Maleki S, Musuvathi M, Mytkowicz T, Nelson J, Saarikivi O. Synthesizing optimal collective algorithms. In <i>Proc</i>. <i>the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming</i>, Feb. 2021, pp.62-75. DOI: <a href="https://doi.org/10.1145/3437801.3441620">10.1145/3437801.3441620</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b76">
<td valign="top" class="td1">
[76]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Panda D K, Tomko K, Schulz K, Majumdar A. The MVAPICH project: Evolution and sustainability of an open source production quality MPI library for HPC. In <i>Proc</i>. <i>the Workshop on Sustainable Software for Science</i>: <i>Practice and Experiences</i>, Nov. 2013.
</div>
</td>
</tr>
<tr class="document-box" id="b77">
<td valign="top" class="td1">
[77]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Wang G H, Venkataraman S, Phanishayee A et al. Blink: Fast and generic collectives for distributed ML. In <i>Proc</i>. <i>the 2020 Machine Learning and Systems 2</i>, Mar. 2020, pp.172-186.
</div>
</td>
</tr>
<tr class="document-box" id="b78">
<td valign="top" class="td1">
[78]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Zhang Z, Chang C K, Lin H B et al. Is network the bottleneck of distributed training? In <i>Proc</i>. <i>the 2020 Workshop on Network Meets AI & ML</i>, Aug. 2020, pp.8-13. DOI: <a href="https://doi.org/10.1145/3405671.3405810">10.1145/3405671.3405810</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b79">
<td valign="top" class="td1">
[79]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Wickramasinghe U, Lumsdaine A. A survey of methods for collective communication optimization and tuning. arXiv: 1611.06334, 2016. <a href="https://arxiv.org/abs/1611.06334">https://arxiv.org/abs/1611.06334</a>, Jan. 2023.
</div>
</td>
</tr>
<tr class="document-box" id="b80">
<td valign="top" class="td1">
[80]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Chan E N, Heimlich M, Purkayastha A, van de Geijn R. Collective communication: Theory, practice, and experience. <i>Concurrency and Computation: Practice and Experience</i>, 2007, 19(13): 1749–1783. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1002/cpe.1206" target="_blank">10.1002/cpe.1206</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b81">
<td valign="top" class="td1">
[81]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Pješivac-Grbović J, Angskun T, Bosilca G, Fagg G E, Gabriel E, Dongarra J J. Performance analysis of MPI collective operations. <i>Cluster Computing</i>, 2007, 10(2): 127–143. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1007/s10586-007-0012-0" target="_blank">10.1007/s10586-007-0012-0</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b82">
<td valign="top" class="td1">
[82]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Vadhiyar S S, Fagg G E, Dongarra J. Automatically tuned collective communications. In <i>Proc</i>. <i>the 2000 ACM/IEEE Conference on Supercomputing</i>, Nov. 2000. DOI: <a href="https://doi.org/10.1109/SC.2000.10024">10.1109/SC.2000.10024</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b83">
<td valign="top" class="td1">
[83]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Verbraeken J, Wolting M, Katzy J <i>et al</i>. A survey on distributed machine learning. <i>ACM Computing Surveys</i>, 2020, 53(2): 30. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/3377454" target="_blank">10.1145/3377454</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b84">
<td valign="top" class="td1">
[84]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Wang M, Fu W J, He X N, Hao S J, Wu X D. A survey on large-scale machine learning. <i>IEEE Trans. Knowledge and Data Engineering</i>, 2022, 34(6): 2574–2594. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/TKDE.2020.3015777" target="_blank">10.1109/TKDE.2020.3015777</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b85">
<td valign="top" class="td1">
[85]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. <i>ACM Computing Surveys</i>, 2019, 52(4): Article No. 65. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/3320060" target="_blank">10.1145/3320060</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b86">
<td valign="top" class="td1">
[86]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Mayer R, Jacobsen H A. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. <i>ACM Computing Surveys</i>, 2020, 53(1): Article No. 3. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/3363554" target="_blank">10.1145/3363554</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b87">
<td valign="top" class="td1">
[87]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Ouyang S, Dong D Z, Xu Y M, Xiao L Q. Communication optimization strategies for distributed deep neural network training: A survey. <i>Journal of Parallel and Distributed Computing</i>, 2021, 149: 52–65. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1016/j.jpdc.2020.11.005" target="_blank">10.1016/j.jpdc.2020.11.005</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b88">
<td valign="top" class="td1">
[88]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Lee S, Purushwalkam S, Cogswell M, Crandall D, Batra D. Why M heads are better than one: Training a diverse ensemble of deep networks. arXiv: 1511.06314, 2015. <a href="https://arxiv.org/abs/1511.06314">https://arxiv.org/abs/1511.06314</a>, Jan. 2023.
</div>
</td>
</tr>
<tr class="document-box" id="b89">
<td valign="top" class="td1">
[89]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In <i>Proc</i>. <i>the 25th International Conference on Neural Information Processing Systems</i>, Dec. 2012, pp.1097-1105.
</div>
</td>
</tr>
<tr class="document-box" id="b90">
<td valign="top" class="td1">
[90]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In <i>Proc</i>. <i>the 2015 </i><i>CVPR</i>, Jun. 2015. DOI: <a href="https://doi.org/10.1109/CVPR.2015.7298594">10.1109/CVPR.2015.7298594</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b91">
<td valign="top" class="td1">
[91]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In <i>Proc</i>. <i>the 2016 </i><i>CVPR</i>, Jun. 2016, pp.770-778. DOI: <a href="https://doi.org/10.1109/CVPR.2016.90">10.1109/CVPR.2016.90</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b92">
<td valign="top" class="td1">
[92]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Shi S H, Wang Q, Chu X W. Performance modeling and evaluation of distributed deep learning frameworks on GPUs. In <i>Proc. the </i><i>DASC/PiCom/DataCom/CyberSciTech</i>, Aug. 2018, pp.949-957. DOI: <a href="https://doi.org/10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.000-4">10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.000-4</a>.
</div>
</td>
</tr>
<tr class="document-box" id="b93">
<td valign="top" class="td1">
[93]
</td>
<td class="td2">
<div class="reference-en" style="margin:0px;padding:0px;">
Hoefler T, Moor D. Energy, memory, and runtime tradeoffs for implementing collective communication operations. <i>Supercomputing Frontiers and Innovations</i>, 2014, 1(2): 58–75. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.14529/jsfi140204" target="_blank">10.14529/jsfi140204</a>.
</div>
</td>
</tr>
</tbody>
</table> |