VastPipe: A High-Throughput Inference System using Adaptive Space-Division Multiplexing DNN Accelerators
-
Abstract
The escalating demand for batched deep learning inference requires concurrent deployment of multiple deep neural network (DNN) models on a shared accelerator, thereby enabling spatial multiplexing to enhance resource utilization.
Spatial multiplexing for co-locating multiple model services on the same accelerator increases the complexity of scheduling within a cluster.
The meticulous collaborative optimization of model co-location combinations and resource allocation in a cluster creates an extensive configuration space for scheduling.
In this paper, we present VastPipe, a high-throughput inference system that schedules batch-oriented and heterogeneous requests on spatial multiplexing-enabled computing clusters. VastPipe determines optimal scheduling configurations by jointly optimizing model co-location and resource allocation using reinforcement learning to solve this combinatorial optimization problem.
The experimental results demonstrate that on a large-scale cluster comprising 250 machine nodes with 1,000 neural processing units (NPUs), VastPipe achieves average performance improvements of 2.2\times, 1.3\times, and 1.2\times compared to the baseline systems.
Furthermore, VastPipe is optimized and evaluated on mainstream GPUs. The results demonstrate that VastPipe achieves average throughput improvements of 2.7\times on the NVIDIA A100 GPU and 1.9\times on the AMD MI100 GPU.
-
-