TCLI: A Triple-Cache Layer-Wise Inference System for Reducing Redundant Data Loading and Computation in Graph Neural Networks
-
Abstract
With the accelerated development of computility infrastructure, achieving efficient inference of graph neural networks (GNNs) on resource-constrained nodes has become a key challenge that limits the fine-grained scheduling and service-oriented utilization of computing resources. However, existing node-wise inference methods are simple but inefficient, especially when dealing with multi-hop neighbors, which can lead to redundant computation issues. This significantly increases computational overhead and becomes a bottleneck in the inference process. Additionally, the limited data transfer efficiency between host memory and GPU memory results in slow inference and underutilized resources. In this paper, we propose TCLI, a tailored triple-cache layer-wise inference system for GNN tasks, designed to minimize redundant loading during the sampling process and redundant computation during the evaluation process, thus accelerating GNN inference. TCLI divides the idle GPU memory into three regions: adjacency matrix cache, node feature cache, and node embedding cache, which are used to accelerate sampling, feature loading, and computation, respectively. We also merge multiple mini-batches into a super-batch to maximize GPU hardware utilization. Furthermore, we customize a pipeline for our system to hide the data loading time during inference. Extensive experimental results show that TCLI achieves an average speedup of 10.40x over DGL and 10.24x over the state-of-the-art layer-wise inference system RAIN in end-to-end inference, with accuracy loss kept below 0.5%.
-
-