We use cookies to improve your experience with our site.
Ming Dun, Hua-Wei Cao, Shu-Han Song, Yuan Zhang, Xiao-Chun Ye. Auto-SpMM: Towards Efficient SpMM on Heterogeneous Server via Automatic Resource Mapping and Pipeline Generation. Journal of Computer Science and Technology. DOI: 10.1007/s11390-026-5632-z
Citation: Ming Dun, Hua-Wei Cao, Shu-Han Song, Yuan Zhang, Xiao-Chun Ye. Auto-SpMM: Towards Efficient SpMM on Heterogeneous Server via Automatic Resource Mapping and Pipeline Generation. Journal of Computer Science and Technology. DOI: 10.1007/s11390-026-5632-z

Auto-SpMM: Towards Efficient SpMM on Heterogeneous Server via Automatic Resource Mapping and Pipeline Generation

  • Sparse Matrix-Matrix Multiplication (SpMM) is the core computing routine in massive advanced fields, including artificial intelligence and scientific computing. Thus accelerating SpMM has become the spotlight for HPC researchers. While many advanced works attempt to optimize SpMM computation kernel on GPU, it is worth noticing that the overall SpMM performance on GPU cannot reach consistent speedup over CPU implementations, due to time-consuming data transfer. Moreover, although SpMM on GPU can resort to pipeline for computation-communication overlap, whether an arbitrary sparse matrix can benefit from pipeline still remains undetermined, and manually tuning the optimal data partition length for pipeline is complicated owing to varied size and sparse distribution among sparse matrices. To address these challenges, we propose Auto-SpMM, an auto-tuning scheme based on machine learning models, which can efficiently optimize SpMM for an arbitrary matrix on a CPU-GPU heterogeneous server. Auto-SpMM firstly extracts concise features that capture the memory access and computation pattern of sparse matrices, and then automatically selects the optimal computation resource and algorithm, along with the best pipeline pattern for those SpMMs. Extensive evaluations with 2,644 sparse matrices demonstrate that Auto-SpMM can obtain the average acceleration rate of 9.71x compared to MKL on CPU and 13.20x compared to cuSPARSE on GPU, respectively.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return