基于带内网络计算的分布式机器学习优化

吉明涛; 金熠波; 钱柱中; 曹拓; 叶保留

doi:10.1007/s11390-024-3342-y

摘要:

研究背景 现如今，分布式机器学习是数据中心应对大规模训练的一种常用的解决方案，逐渐演变为数据中心的基本组件，应用在计算机视觉、自然语言处理等多个领域。通常情况下，分布式机器学习的训练的过程由多个工作节点各自训练本地样本，计算产生梯度后将其发送至参数服务器，参数服务器汇聚后将计算后的模型下发至各工作节点。该过程往往需要通过不断的迭代使得模型精度更高，以满足应用的不同推断需求。随着各类硬件加速器的出现，工作节点的本地训练时间能够明显降低，而由模型更新造成的通信开销通常较大难以削弱，逐渐成为分布式训练的瓶颈。

目的本研究旨在利用带内网络遥测技术提供的网络信息，为分布式训练数据流选择合适的聚合位置，优化分布式机器学习训练过程中的通信开销。通过选择合适的网络交换机中聚合数据，本研究减少了需要传输的数据量，从而降低网络拥塞和提高数据处理效率。此外，该技术还能改善网络的带宽利用率，加速机器学习模型的训练过程，实现更高效的学习与开发。这对推动人工智能技术在实际应用中的广泛部署具有重要意义。

方法本研究将聚合位置的选择问题建模成以数据流完成时间为优化目标的优化问题，受限于聚合的数量与资源的容量。具体地，建模后的问题存在非线性形式并且该问题为整数决策问题。观察该问题的形式，问题的约束满足幺模性质，并且优化目标可以等价转换为可分离凸函数。根据凸优化的经验，满足这两个条件该问题可以等价的转换成实数域的线性规划问题。也即，转换后的问题，实数解一定是整数。这使得我们可以直接用求解器求解、部署。

结果我们在真实的P4交换机上测试了我们提出协议AGG的性能；另一方面，我们基于Mininet、bmv2软件环境，在大规模网络中评估了我们算法效果，将其和其他已有的算法进行了对比。实验结果表明，我们所提出的算法AGG，相比与已有的带内网络聚合机制，我们所提出的AGG能够有效的降低分布式训练的通信时间，将训练的效果提升了40%。

结论随着数据中心网络和应用在不断发展，P4可编程性交换机逐渐走入大众视野，其灵活的可编程性为网络建设者重构网络架构、优化网络性能带来了契机。下一步工作，将会继续基于P4可编程的交换机优化应用，包含机器学习推断类应用、数据分析类应用、心跳机制等。特别是对于机器学习推断类应用，目前已实现了在交换机中部署决策树模型来推断手写输入字体。在未来，这个方面的研究将会对，边缘智能、集群智能等领域产生深远的影响。

Abstract: Distributed machine learning systems train models via iterative updates between parallel workers and the parameter server. To expedite the transmissions, in-network aggregation of updates along with the packet forwarding at those programmable switches decreases the network traffic over these bottleneck links. However, existing in-network aggregation schemas do not adequately prepare the most suitable switches for various worker distributions and fail to capture the dynamic network status. Based on the status derived from in-band network telemetry, we aim to select the best switches upon the optimization we formulate with the objective of minimum transmission latency. Although the problem is actually a non-linear integer program, by adopting delicate transformations, a substitute with totally unimodular constraints and separable convex objective is then solved to obtain the integral optimum. We implement our in-network aggregation protocol and reconstruct in-band network telemetry protocol upon real devices, i.e., Barefoot Wedge100BF and Dell servers. Then, we evaluate the performance of our proposed AGG algorithm and the results indicate that the completion of related coflows decreases 40% on average compared with other strategies, improving at least 30% performance, compared with the state-of-the-art.

基于带内网络计算的分布式机器学习优化

Orchestrating In-Network Aggregation for Distributed Machine Learning via In-Band Network Telemetry