面向大语言模型训练的智能计算系统综述

张振兴; 文渊博; 吕涵祺; 刘畅; 张蕊; 李夏青; 王超; 杜子东; 郭崎; 李玲; 周学海; 陈云霁

doi:10.1007/s11390-024-4178-1

摘要:

研究背景 大语言模型已成为人工智能研究中一个重点领域，大语言模型的快速发展与支持其计算的智能计算系统密不可分。当前通用处理器的发展已无法满足人工智能领域的快速发展需求，高性能智能芯片成为当前智能计算系统的核心。大模型前所未有的规模，已经超出了单一智能芯片计算能力，迫切需要将工作分配到多个加速器上，并利用分布式系统进行协同训练。

目的本文详尽介绍了在训练大语言模型在模型训练时所使用的智能计算系统。重点讨论了大模型算法的执行与调度，特别强调了分布式计算技术、内存管理优化以及计算效率提升的关键作用。通过连接顶层的大语言模型算法、中层的编程软件以及底层的智能芯片和加速器，旨在帮助读者全面理解人工智能的软硬件体系。

方法在最新的行业动态和学术数据库中搜索相关论文，并按算法、硬件、软件等主要类别进行分类和整理。首先对主流模型进行了分类，介绍其结构特征，并讨论了训练和推理两种模式，同时评估了运算强度；接着分别讨论了计算节点和计算集群；随后，整理了用于训练的编程框架，及其在分布式计算、内存管理和计算效率方面的相关技术；最后，概述了算法、硬件和软件方面的未来工作。

结果本文整理十余种大模型训练编程框架，介绍并对比了当前在分布式计算、内存管理和计算效率方面的优化技术。由于注意力机制的操作强度较低，导致效率受限，因此需要通过改进算法和硬件软件协同设计来提升硬件利用率。要实现大规模的高效模型训练，必须采用全局视角的调度和动态调整机制，以优化资源利用，减少瓶颈。

结论本文系统探讨了面向大语言模型训练的智能计算系统，涵盖算法、硬件平台、软件方法及优化策略。有助于提升读者对智能计算系统的基础设施，尤其是面向大模型训练的理解，同时期望有助于推动算法、硬件和软件的进一步研究。

Abstract: In this paper, we present a comprehensive overview of artificial intelligence (AI) computing systems for large language models (LLMs) training. The rapid advancement of LLMs in recent years, coupled with the widespread adoption of algorithms and applications such as BERT, ChatGPT, and DeepSeek, has sparked significant interest in this field. We classify LLMs into encoder-only, encoder-decoder, and decoder-only models, and briefly analyze their training and inference processes to emphasize their substantial need for computational resources. These operations depend heavily on AI-specific accelerators like GPUs (graphics processing units), TPUs (tensor processing units), and MLUs (machine learning units). However, as the gap widens between the increasing complexity of LLMs and the current capabilities of accelerators, it becomes essential to adopt heterogeneous computing systems optimized for distributed environments to manage the growing computational and memory requirements of LLMs. We delve into the execution and scheduling of LLM algorithms, underlining the critical role of distributed computing strategies, memory management enhancements, and boosting computational efficiency. This paper clarifies the complex relationship between algorithm design, hardware infrastructure, and software optimization, and provides an in-depth understanding of both the software and hardware infrastructure supporting LLMs training, offering insights into the challenges and potential avenues for future development and deployment.

面向大语言模型训练的智能计算系统综述

AI Computing Systems for Large Language Models Training