SwFormer:基于算子分块与调度策略的新一代神威超算基础大模型加速研究
SwFormer: Enabling Faster Foundation Models on New Sunway Supercomputer via Holistic Kernel Tiling and Scheduling
-
摘要:研究背景 深度学习正在持续变革各个领域,推动了基于Transformer的大规模基础模型(如GPT-3)的快速发展。然而,这些模型的训练和推理对计算和存储提出了极高的性能需求。新一代神威超级计算机由大量的SW26010pro处理器组成,提供了大共享模式和单核组模式两种运行方式。其中,大共享模式因其具有更大的内存容量,能够支持运行更大规模的基础模型。然而,目前面向SW26010pro处理器的优化工作主要针对单核组模式下的AI算子性能提升,大共享模式下的算子性能提升研究仍面临诸多挑战。目的 本研究旨在解决大共享模式下算子性能不足的问题,尤其是在6 核组模式配置时。通过算子划分和算子调度方法,提高大共享模式算子性能,优化模型的训练效率,同时简化单核组模式算子迁移到多核组模式算子的工程实现复杂度。方法 我们提出了一种基于intra-op算子分块与inter-op算子调度的两阶段优化框架SwFormer。Intra-op算子分块方法将算子划分为细粒度分块,同时结合离线的性能分析方法,确定最优分块策略。针对GEMM算子,采用tile-fusion策略,将小规模分块合并为更大规模分块以提升性能。针对SWattention,改进并实现了SWattentionV2,通过信号量实现多核组(core group, CG)协作,从而扩展单个attention head的计算到多个CG。Inter-op算子调度方法基于宽度优先搜索(BFS)的图遍历方法,对算子间依赖关系进行分析,并对算子计算顺序进行重新排序,并结合离线性能分析方法,调整算子分块大小,以实现更加整体的性能优化。结果 Intra-op分块方法在GPT-3 6.7B与13B模型训练中,相比现有all-shared模式算子库(如SWDNNv2 SWattention),将端到端训练性能提升至最多1.27倍。Inter-op调度方法在此基础上进一步提升性能至最多1.32倍。结论 SwFormer简化了多核组算子的设计,为基础模型在新一代神威超算上的高效运行提供了解决方案,并为未来SW26010pro处理器的AI计算生态发展提供了参考。Abstract: Deep learning's continuous evolution has driven the creation of increasingly large foundation models, such as GPT-3, which requires optimized performance on large-scale computing platforms. The new Sunway Supercomputer, equipped with numerous SW26010pro processors, supports AI workloads in both all-shared and single-CG (core group) modes. However, existing optimizations primarily target AI operators like Generalized Matrix Multiplication (GEMM) in the single-CG mode, leaving challenges in scaling performance across all six CGs in the all-shared mode. This paper introduces SwFormer, a framework designed to accelerate foundation models via intra-op tiling and inter-op scheduling. The intra-op tiling method breaks down operators into fine-grained tiled kernels and employs an offline profiling-based approach to determine the optimal tiling strategy. The inter-op scheduling method employs heuristic graph traversal algorithms to automatically reorder the computation of these tiled kernels, thereby maximizing hardware utilization. Compared with operator libraries for the all-shared mode such as SWDNNv2 and SWattention, SwFormer's intra-op tiling method accelerates end-to-end GPT-3 6.7B and 13B models training by up to 1.27x. Evaluated with GPT-style models, the inter-op scheduling method further outperforms the intra-op tiling method by up to 1.32x.
下载: