VastPipe：面向多样加速器的基于自适应空分复用的高吞吐推理系统

马立贤; 王乐平; 邵恩; 曹荣禹; 谭光明

doi:10.1007/s11390-024-3773-5

VastPipe：面向多样加速器的基于自适应空分复用的高吞吐推理系统

\ttVastPipe : A High-Throughput Inference System via Adaptive Space-Division Multiplexing for Diverse Accelerators

摘要

摘要:
研究背景 随着批量深度学习推理需求的不断增长，并行部署多个深度神经网络（DNN）模型服务在同一个DNN加速器上，通过空间复用加速器可以有效提高资源利用率。然而这样加剧了集群内调度的复杂性。模型共置组合与集群资源分配的精细化协同优化为调度提供了广阔的配置空间。这个挑战可以表述为一个组合优化问题，其复杂性随着集群的大小和传入请求的数量呈指数级增长。从一个巨大的优化空间中找到最佳位置是相当困难的。
目的通过利用强化学习来解决空间复用加速器集群中所引发的组合优化问题，通过建模任务特征与集群状态基于强化学习智能体找到一个最优的调度配置，以同时优化模型部署和资源分配。
方法本文提出VastPipe，一种高吞吐量推理系统，将大批量和异构请求映射到细粒度资源管理视角支持的集群。使用强化学习来学习大规模请求到集群资源的最优映射，包括每个加速器上的模型共置方案和每个模型服务的资源分配。此外，对VastPipe进行了与主流NVIDIA GPU和AMD GPU的兼容性优化评估。
结果实验结果表明，在包含250个节点和1000个神经处理单元（neural processing unit，NPU）的大规模集群上，VastPipe的性能平均提升了2.2倍、1.3倍和1.2倍。此外，VastPipe还对其与主流GPU的兼容性进行了优化和评估。实验结果表明，VastPipe在NVIDIA A100 GPU和AMD MI100 GPU上分别取得了2.7倍和1.9倍的平均吞吐量提升。
结论本研究的目的是提高深度神经网络计算集群中加速器的推理吞吐量，特别是在涉及大规模和异构深度神经网络推理请求的场景中。VastPipe基于强化学习进行细粒度的资源分配和自适应的调度配置规划。在此基础上，我们将在当前先进的DNN加速器上进一步研究流行的大规模语言模型服务的批处理系统。

Abstract: The escalating demand on batched deep learning inference requires concurrent deployment of multiple deep neural network (DNN) models on a shared accelerator, thereby enabling spatial multiplexing to enhance resource utilization. Spatial multiplexing for co-locating multiple model services on the same accelerator increases the complexity of scheduling within a cluster. The meticulous collaborative optimization of model co-location combinations and resource allocation in a cluster creates an extensive configuration space for scheduling. In this paper, we present \tt VastPipe, a high-throughput inference system that schedules batch-oriented and heterogeneous requests on spatial multiplexing-enabled computing clusters. \ttVastPipe determines optimal scheduling configurations by jointly optimizing model co-location and resource allocation using reinforcement learning to solve this combinatorial optimization problem. The experimental results demonstrate that on a large-scale cluster comprising 250 machine nodes with 1000 neural processing units (NPUs), \ttVastPipe achieves average performance improvements of 2.2x, 1.3x, and 1.2x compared with the baseline systems, respectively. Furthermore, \ttVastPipe is optimized and evaluated on mainstream GPUs. The results demonstrate that \ttVastPipe achieves average throughput improvements of 2.7x on the NVIDIA A100 GPU and 1.9x on the AMD MI100 GPU.

HTML全文

参考文献()

施引文献

资源附件()