We use cookies to improve your experience with our site.
Ming-Tao Ji, Yi-Bo Jin, Zhu-Zhong Qian, Tuo Cao, Bao-Liu Ye. Orchestrating In-network Aggregation for Distributed Machine Learning via In-band Network Telemetry[J]. Journal of Computer Science and Technology. DOI: 10.1007/s11390-024-3342-y
Citation: Ming-Tao Ji, Yi-Bo Jin, Zhu-Zhong Qian, Tuo Cao, Bao-Liu Ye. Orchestrating In-network Aggregation for Distributed Machine Learning via In-band Network Telemetry[J]. Journal of Computer Science and Technology. DOI: 10.1007/s11390-024-3342-y

Orchestrating In-network Aggregation for Distributed Machine Learning via In-band Network Telemetry

  • Distributed machine learning systems train models via iterative updates between parallel workers and the parameter server. To expedite the transmissions, in-network aggregation of updates along with the packet forwarding at those programmable switches decreases the network traffic over these bottleneck links. However, existing in-network aggregation schemas do not adequately prepare the most suitable switches for various worker distributions and fail to capture the dynamic network status. Based on the status derived from in-band network telemetry, we aim to select the best switches upon the optimization we formulate with the objective of minimum transmission latency. Although the problem is actually a non-linear integer program, by adopting delicate transformations, a substitute with totally unimodular constraints and separable convex objective is then solved to obtain the integral optimum. We implement our in-network aggregation protocol and reconstruct in-band network telemetry protocol upon real devices, i.e., Barefoot Wedge100BF and Dell servers. Then, we evaluate the performance of our algorithm and the results indicate that the completion of related coflows decreases 40% on average compared with other strategies, improving at least 30% performance, compared with the state-of-the-art.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return