SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.
Citation: | Ji MT, Jin YB, Qian ZZ et al. Orchestrating in-network aggregation for distributed machine learning via in-band network telemetry. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 40(1): 196−214, Jan. 2025. DOI: 10.1007/s11390-024-3342-y |
Distributed machine learning systems train models via iterative updates between parallel workers and the parameter server. To expedite the transmissions, in-network aggregation of updates along with the packet forwarding at those programmable switches decreases the network traffic over these bottleneck links. However, existing in-network aggregation schemas do not adequately prepare the most suitable switches for various worker distributions and fail to capture the dynamic network status. Based on the status derived from in-band network telemetry, we aim to select the best switches upon the optimization we formulate with the objective of minimum transmission latency. Although the problem is actually a non-linear integer program, by adopting delicate transformations, a substitute with totally unimodular constraints and separable convex objective is then solved to obtain the integral optimum. We implement our in-network aggregation protocol and reconstruct in-band network telemetry protocol upon real devices, i.e., Barefoot Wedge100BF and Dell servers. Then, we evaluate the performance of our proposed AGG algorithm and the results indicate that the completion of related coflows decreases 40% on average compared with other strategies, improving at least 30% performance, compared with the state-of-the-art.
[1] |
Xu M, Du H, Niyato D, Kang J, Xiong Z, Mao S, Han Z, Jamalipour A, Kim D I, Shen X, Leung V C M, Poor H V. Unleashing the power of edge-cloud generative AI in mobile networks: A survey of AIGC services. IEEE Communications Surveys & Tutorials, 2024, 26(2): 1127–1170. DOI: 10.1109/COMST.2024.3353265.
|
[2] |
Mishra R, Gupta H P, Banga G, Das S K. Fed-RAC: Resource-aware clustering for tackling heterogeneity of participants in federated learning. IEEE Trans. Parallel and Distributed Systems, 2024, 35(7): 1207–1220. DOI: 10.1109/TPDS.2024.3379933.
|
[3] |
Wu D, Ullah R, Rodgers P, Kilpatrick P, Spence I, Varghese B. EcoFed: Efficient communication for DNN partitioning-based federated learning. IEEE Trans. Parallel and Distributed Systems, 2024, 35(3): 377–390. DOI: 10.1109/TPDS.2024.3349617.
|
[4] |
Feng A, Dong D, Lei F, Ma J, Yu E, Wang R. In-network aggregation for data center networks: A survey. Computer Communications, 2023, 198: 63–76. DOI: 10.1016/j.comcom.2022.11.004.
|
[5] |
Liu J, Zhai Y, Zhao G, Xu H, Fang J, Zeng Z, Zhu Y. InArt: In-Network aggregation with route selection for accelerating distributed training. In Proc. the 2024 ACM on Web Conference, May 2024, pp.2879–2889. DOI: 10.1145/3589334.3645394.
|
[6] |
Li Y, Liu I J, Yuan Y, Chen D, Schwing A, Huang J. Accelerating distributed reinforcement learning with in-switch computing. In Proc. the 46th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), Jun. 2019, pp.279–291.
|
[7] |
Cui T, Zhang W, Zhang K, Krishnamurthy A. Offloading load balancers onto SmartNICs. In Proc. the 12th ACM SIGOPS Asia-Pacific Workshop on Systems, Aug. 2021, pp.56–62. DOI: 10.1145/3476886.3477505.
|
[8] |
Lao C, Le Y, Mahajan K, Chen Y, Wu W, Akella A, Swift M. ATP: In-network aggregation for multi-tenant learning. In Proc. the 18th USENIX Symposium on Networked Systems Design and Implementation, Apr. 2021, pp.741–761.
|
[9] |
Sapio A, Canini M, Ho C Y, Nelson J, Kalnis P, Kim C, Krishnamurthy A, Moshref M, Ports D R K, Richtárik P. Scaling distributed machine learning with in-network aggregation. In Proc. the 18th USENIX Symposium on Networked Systems Design and Implementation, Apr. 2021, pp.785–808.
|
[10] |
Gebara N, Costa P, Ghobadi M. PANAMA: In-network aggregation for shared machine learning clusters. In Proc. the 4th Conference on Machine Learning and Systems, Apr. 2021.
|
[11] |
Wang H, Qin Y, Lao C I, Le Y, Wu W, Chen K. Efficient data-plane memory scheduling for in-network aggregation. arXiv: 2201.06398, 2022. https://arxiv.org/abs/2201.06398, Sept. 2024.
|
[12] |
Sapio A, Abdelaziz I, Canini M, Kalnis P. DAIET: A system for data aggregation inside the network. In Proc. the 2017 Symposium on Cloud Computing, Sept. 2017, p.626. DOI: 10.1145/3127479.3132018.
|
[13] |
Segal R, Avin C, Scalosub G. Constrained in-network computing with low congestion in datacenter networks. In Proc. the 2022 IEEE Conference on Computer Communications, May. 2022, pp.1639–1648. DOI: 10.1109/INFOCOM48880.2022.9796980.
|
[14] |
Ji M, Su C, Fan Y, Jin Y, Qian Z, Yan Y, Chen Y, Cao T, Zhang S, Ye B. INTaaS: Provisioning in-band network telemetry as a service via online learning. Computer Networks, 2024, 241: 110211. DOI: 10.1016/j.comnet.2024.110211.
|
[15] |
Ji M, Su C, Yan Y, Qian Z, Chen Y, Jin Y, Zhang S, Ye B. INTView: Adaptive planner for in-band network telemetry without detours. In Proc. the 2023 IEEE International Conference on Communications, May 28–Jun. 1, 2023, pp.5490–5495. DOI: 10.1109/ICC45041.2023.10279624.
|
[16] |
Ji M, Su C, Yan Y, Qian Z, Zhang S, Chen Y, Cao T, Shi X, Vasquez L, Ye B. Adaptive provisioning in-band network telemetry at computing power network [invited]. In Proc. the 31st IEEE/ACM International Symposium on Quality of Service (IWQoS), Jun. 2023. DOI: 10.1109/IWQoS57198.2023.10188738.
|
[17] |
Salkin H M, De Kluyver C A. The knapsack problem: A survey. Naval Research Logistics Quarterly, 1975, 22(1): 127–144. DOI: 10.1002/nav.3800220110.
|
[18] |
Fang J, Zhao G, Xu H, Yu Z, Shen B, Xie L. GOAT: Gradient scheduling with collaborative in-network aggregation for distributed training. In Proc. the 31st IEEE/ACM International Symposium on Quality of Service (IWQoS), Jun. 2023. DOI: 10.1109/IWQoS57198.2023.10188783.
|
[19] |
Yang M, Baban A, Kugel V, Libby J, Mackie S, Kananda S S R, Wu C H, Ghobadi M. Using Trio: Juniper networks’ programmable chipset-for emerging in-network applications. In Proc. the 2022 ACM SIGCOMM Conference, Aug. 2022, pp.633–648. DOI: 10.1145/3544216.3544262.
|
[20] |
Gao W, Sun P, Wen Y, Zhang T. Titan: A scheduler for foundation model fine-tuning workloads. In Proc. the 13th Symposium on Cloud Computing, Nov. 2022, pp.348–354. DOI: 10.1145/3542929.3563460.
|
[21] |
Zhang Q, Zhou R, Wu C, Jiao L, Li Z. Online scheduling of heterogeneous distributed machine learning jobs. In Proc. the 21st International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, Oct. 2020, pp.111–120. DOI: 10.1145/3397166.3409128.
|
[22] |
Alizadeh M, Edsall T, Dharmapurikar S, Vaidyanathan R, Chu K, Fingerhut A, Lam V T, Matus F, Pan R, Yadav N, Varghese G. CONGA: Distributed congestion-aware load balancing for datacenters. ACM SIGCOMM Computer Communication Review, 2014, 44(4): 503–514. DOI: 10.1145/2740070.2626316.
|
[23] |
Katta N, Hira M, Kim C, Sivaraman A, Rexford J. HULA: Scalable load balancing using programmable data planes. In Proc. the 2016 Symposium on SDN Research, Mar. 2016, Article No. 10. DOI: 10.1145/2890955.2890968.
|
[24] |
Katta N, Ghag A, Hira M, Keslassy I, Bergman A, Kim C, Rexford J. Clove: Congestion-aware load balancing at the virtual edge. In Proc. the 13th International Conference on Emerging Networking Experiments and Technologies, Nov. 2017, pp.323–335. DOI: 10.1145/3143361.3143401.
|
[25] |
Zheng J, Qin L, Liu K, Tian B, Tian C, Li B, Chen G. Django: Bilateral coflow scheduling with predictive concurrent connections. Journal of Parallel and Distributed Computing, 2021, 152: 45–56. DOI: 10.1016/j.jpdc.2021.01.006.
|
[26] |
Pan T, Song E, Bian Z, Lin X, Peng X, Zhang J, Huang T, Liu B, Liu Y. INT-path: Towards optimal path planning for in-band network-wide telemetry. In Proc. the 2019 IEEE Conference on Computer Communications, Apr. 29–May 2, 2019, pp.487–495. DOI: 10.1109/INFOCOM.2019.8737529.
|
[27] |
Kim C, Sivaraman A, Katta N, Bas A, Dixit A, Wobker L J, Networks B. In-band network telemetry via programmable dataplanes. In Proc. the 2015 ACM SIGCOMM, Aug. 2015.
|
[28] |
Guo C, Yuan L, Xiang D, Dang Y, Huang R, Maltz D, Liu Z, Wang Y, Pang B, Chen H, Lin Z W, Kurien V. Pingmesh: A large-scale system for data center network latency measurement and analysis. In Proc. the 2015 ACM Conference on Special Interest Group on Data Communication, Aug. 2015, pp.139–152. DOI: 10.1145/2785956.2787496.
|
[29] |
Meyer R R. A class of nonlinear integer programs solvable by a single linear program. SIAM Journal on Control and Optimization, 1977, 15(6): 935–946. DOI: 10.1137/0315059.
|
[30] |
Pan H, Cui P, Li Z, Jia R, Zhang P, Zhang L, Yang Y, Wu J, Dong J, Cao Z, Li Q, Liu H H, Laurent M, Xie G. Enabling fast and flexible distributed deep learning with programmable switches. arXiv: 2205.05243, 2022, https://arxiv.org/abs/2205.05243, Sept. 2024.
|
[31] |
Parizotto R, Coelho B L, Nunes D C, Haque I, Schaeffer-Filho A. Offloading machine learning to programmable data planes: A systematic survey. ACM Computing Surveys, 2023, 56(1): Article No. 18. DOI: 10.1145/3605153.
|
[32] |
Costa P, Donnelly A, Rowstron A, O’Shea G. Camdoop: Exploiting in-network aggregation for big data applications. In Proc. the 9th USENIX Conference on Networked Systems Design and Implementation, Apr. 2012, Article No. 3.
|
[33] |
Mai L, Rupprecht L, Alim A, Costa P, Migliavacca M, Pietzuch P, Wolf A L. NetAgg: Using middle-boxes for application-specific on-path aggregation in data centres. In Proc. the 10th ACM International on Conference on Emerging Networking Experiments and Technologies, Dec. 2014, pp.249–262. DOI: 10.1145/2674005.2674996.
|
[34] |
Graham R L, Bureddy D, Lui P, Rosenstock H, Shainer G, Bloch G, Goldenerg D, Dubman M, Kotchubievsky S, Koushnir V, Levi L, Margolin A, Ronen T, Shpiner A, Wertheim O, Zahavi E. Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction. In Proc. the 1st Workshop on Optimization of Communication in HPC, Nov. 2016.
|
[35] |
Sapio A, Abdelaziz I, Aldilaijan A, Canini M, Kalnis P. In-Network computation is a dumb idea whose time has come. In Proc. the 16th ACM Workshop on Hot Topics in Networks, Nov. 2017, pp.150–156. DOI: 10.1145/3152434.3152461.
|
[36] |
Yang F, Wang Z, Ma X, Yuan G, An X. SwitchAgg: A further step towards in-network computing. In Proc. the 2019 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Dec. 2019, pp.36–45. DOI: 10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00017.
|
[37] |
Ye Z, Gao W, Hu Q, Sun P, Wang X, Luo Y, Zhang T, Wen Y. Deep learning workload scheduling in GPU datacenters: A survey. ACM Computing Surveys, 2024, 56(6): Article No. 146. DOI: 10.1145/3638757.
|
[38] |
Rajasekaran S, Ghobadi M, Akella A. CASSINI: Network-aware job scheduling in machine learning clusters. In Proc. the 21st USENIX Symposium on Networked Systems Design and Implementation, Apr. 2024.
|
[39] |
Gu D, Zhao Y, Zhong Y, Xiong Y, Han Z, Cheng P, Yang F, Huang G, Jin X, Liu X. ElasticFlow: An elastic serverless training platform for distributed deep learning. In Proc. the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Jan. 2023, pp.266–280. DOI: 10.1145/3575693.3575721.
|
[40] |
Jayaram Subramanya S, Arfeen D, Lin S, Qiao A, Jia Z, Ganger G R. Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling. In Proc. the 29th Symposium on Operating Systems Principles, Oct. 2023, pp.642–657. DOI: 10.1145/3600006.3613175.
|
[41] |
Zhou Q, Wang K, Li P, Zeng D, Guo S, Ye B, Guo M. Fast coflow scheduling via traffic compression and stage pipelining in datacenter networks. IEEE Trans. Computers, 2019, 68(12): 1755–1771. DOI: 10.1109/TC.2019.2931716.
|
[42] |
Jin Y, Jiao L, Ji M, Qian Z, Zhang S, Chen N, Lu S. Scheduling in-band network telemetry with convergence-preserving federated learning. IEEE/ACM Trans. Networking, 2023, 31(5): 2313–2328. DOI: 10.1109/TNET.2023.3253302.
|
[43] |
Ji M, Qian Z, Ye B. When CPN meets AI: Resource provisioning for inference query upon computing power network. In Proc. the 29th International Conference on Parallel and Distributed Systems (ICPADS), Dec. 2023, pp.2261–2268. DOI: 10.1109/ICPADS60453.2023.00304.
|
[44] |
Ji M, Zhang Z, Zhang Y, Qian Z, Cao T, Su C, Ye B. Incentivizing edge AI with accuracy preserving via online randomized auctions. In Proc. the 20th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), Sept. 2023, pp.384–385. DOI: 10.1109/SECON58729.2023.10287443.
|
[1] | Yi-Xiao Gao, Chen Tian, Wei Chen, Duo-Xing Li, Jian Yan, Yuan-Yuan Gong, Bing-Quan Wang, Tao Wu, Lei Han, Fa-Zhi Qi, Shan Zeng, Wan-Chun Dou, Gui-Hai Chen. Analyzing and Optimizing Packet Corruption in RDMA Network[J]. Journal of Computer Science and Technology, 2022, 37(4): 743-762. DOI: 10.1007/s11390-022-2123-8 |
[2] | Sa Wang, Yan-Hai Zhu, Shan-Pei Chen, Tian-Ze Wu, Wen-Jie Li, Xu-Sheng Zhan, Hai-Yang Ding, Wei-Song Shi, Yun-Gang Bao. A Case for Adaptive Resource Management in Alibaba Datacenter Using Neural Networks[J]. Journal of Computer Science and Technology, 2020, 35(1): 209-220. DOI: 10.1007/s11390-020-9732-x |
[3] | Shuai-Bing Lu, Jie Wu, Huan-Yang Zheng, Zhi-Yi Fang. On Maximum Elastic Scheduling in Cloud-Based Data Center Networks for Virtual Machines with the Hose Model[J]. Journal of Computer Science and Technology, 2019, 34(1): 185-206. DOI: 10.1007/s11390-019-1890-3 |
[4] | Xi Wang, Jian-Xi Fan, Cheng-Kuan Lin, Jing-Ya Zhou, Zhao Liu. BCDC: A High-Performance, Server-Centric Data Center Network[J]. Journal of Computer Science and Technology, 2018, 33(2): 400-416. DOI: 10.1007/s11390-018-1826-3 |
[5] | Xue-Kai Du, Zhi-Hui Lu, Qiang Duan, Jie Wu, Cheng-Rong Wu. LTSS:Load-Adaptive Traffic Steering and Forwarding for Security Services in Multi-Tenant Cloud Datacenters[J]. Journal of Computer Science and Technology, 2017, 32(6): 1265-1278. DOI: 10.1007/s11390-017-1799-7 |
[6] | Yi-Hong Gao, Hua-Dong Ma, Wu Liu. Minimizing Resource Cost for Camera Stream Scheduling in Video Data Center[J]. Journal of Computer Science and Technology, 2017, 32(3): 555-570. DOI: 10.1007/s11390-017-1743-x |
[7] | Yan-Chao Zhao, Jie Wu, Wen-Zhong Li, Sang-Lu Lu. Throughput Optimization in Cognitive Radio Networks Ensembling Physical Layer Measurement[J]. Journal of Computer Science and Technology, 2015, 30(6): 1290-1305. DOI: 10.1007/s11390-015-1599-x |
[8] | Tao Jiang, Rui Hou, Jian-Bo Dong, Lin Chai, Sally A. McKee, Bin Tian, Li-Xin Zhang, Ning-Hui Sun. Adapting Memory Hierarchies for Emerging Datacenter Interconnects[J]. Journal of Computer Science and Technology, 2015, 30(1): 97-109. DOI: 10.1007/s11390-015-1507-4 |
[9] | Li Chen, Baochun Li, Bo Li. Allocating Bandwidth in Datacenter Networks: A Survey[J]. Journal of Computer Science and Technology, 2014, 29(5): 910-917. DOI: 10.1007/s11390-014-1478-x |
[10] | Peyman Teymoori, Nasser Yazdani. Delay-Constrained Optimized Packet Aggregation in High-Speed Wireless Networks[J]. Journal of Computer Science and Technology, 2013, 28(3): 525-539. DOI: 10.1007/s11390-013-1353-1 |