SwFormer：基于算子分块与调度策略的新一代神威超算基础大模型加速研究

吴若晗; 朱先语; 陈俊仕; 安虹

doi:10.1007/s11390-025-4761-0

SwFormer：基于算子分块与调度策略的新一代神威超算基础大模型加速研究

SwFormer: Enabling Faster Foundation Models on New Sunway Supercomputer via Holistic Kernel Tiling and Scheduling

摘要

摘要:
研究背景 深度学习正在持续变革各个领域，推动了基于Transformer的大规模基础模型（如GPT-3）的快速发展。然而，这些模型的训练和推理对计算和存储提出了极高的性能需求。新一代神威超级计算机由大量的SW26010pro处理器组成，提供了大共享模式和单核组模式两种运行方式。其中，大共享模式因其具有更大的内存容量，能够支持运行更大规模的基础模型。然而，目前面向SW26010pro处理器的优化工作主要针对单核组模式下的AI算子性能提升，大共享模式下的算子性能提升研究仍面临诸多挑战。
目的本研究旨在解决大共享模式下算子性能不足的问题，尤其是在6 核组模式配置时。通过算子划分和算子调度方法，提高大共享模式算子性能，优化模型的训练效率，同时简化单核组模式算子迁移到多核组模式算子的工程实现复杂度。
方法我们提出了一种基于intra-op算子分块与inter-op算子调度的两阶段优化框架SwFormer。Intra-op算子分块方法将算子划分为细粒度分块，同时结合离线的性能分析方法，确定最优分块策略。针对GEMM算子，采用tile-fusion策略，将小规模分块合并为更大规模分块以提升性能。针对SWattention，改进并实现了SWattentionV2，通过信号量实现多核组(core group, CG)协作，从而扩展单个attention head的计算到多个CG。Inter-op算子调度方法基于宽度优先搜索（BFS）的图遍历方法，对算子间依赖关系进行分析，并对算子计算顺序进行重新排序，并结合离线性能分析方法，调整算子分块大小，以实现更加整体的性能优化。
结果 Intra-op分块方法在GPT-3 6.7B与13B模型训练中，相比现有all-shared模式算子库（如SWDNNv2 SWattention），将端到端训练性能提升至最多1.27倍。Inter-op调度方法在此基础上进一步提升性能至最多1.32倍。
结论 SwFormer简化了多核组算子的设计，为基础模型在新一代神威超算上的高效运行提供了解决方案，并为未来SW26010pro处理器的AI计算生态发展提供了参考。

Abstract: Deep learning's continuous evolution has driven the creation of increasingly large foundation models, such as GPT-3, which requires optimized performance on large-scale computing platforms. The new Sunway Supercomputer, equipped with numerous SW26010pro processors, supports AI workloads in both all-shared and single-CG (core group) modes. However, existing optimizations primarily target AI operators like Generalized Matrix Multiplication (GEMM) in the single-CG mode, leaving challenges in scaling performance across all six CGs in the all-shared mode. This paper introduces SwFormer, a framework designed to accelerate foundation models via intra-op tiling and inter-op scheduling. The intra-op tiling method breaks down operators into fine-grained tiled kernels and employs an offline profiling-based approach to determine the optimal tiling strategy. The inter-op scheduling method employs heuristic graph traversal algorithms to automatically reorder the computation of these tiled kernels, thereby maximizing hardware utilization. Compared with operator libraries for the all-shared mode such as SWDNNv2 and SWattention, SwFormer's intra-op tiling method accelerates end-to-end GPT-3 6.7B and 13B models training by up to 1.27x. Evaluated with GPT-style models, the inter-op scheduling method further outperforms the intra-op tiling method by up to 1.32x.

HTML全文

参考文献()

施引文献

资源附件()