AI Computing Systems for Large Language Models Training

Zhen-Xing Zhang; Yuan-Bo Wen; Han-Qi Lyu; Chang Liu; Rui Zhang; Xia-Qing Li; Chao Wang; Zi-Dong Du; Qi Guo; Ling Li; Xue-Hai Zhou; Yun-Ji Chen

doi:10.1007/s11390-024-4178-1

Zhang ZX, Wen YB, Lyu HQ et al. AI computing systems for large language models training. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 40(1): 6−41, Jan. 2025. DOI: 10.1007/s11390-024-4178-1

Citation:

Zhang ZX, Wen YB, Lyu HQ et al. AI computing systems for large language models training. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 40(1): 6−41, Jan. 2025. DOI: 10.1007/s11390-024-4178-1

Citation:

Zhang ZX, Wen YB, Lyu HQ et al. AI computing systems for large language models training. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 40(1): 6−41, Jan. 2025. DOI: 10.1007/s11390-024-4178-1

AI Computing Systems for Large Language Models Training

Abstract

Abstract

In this paper, we present a comprehensive overview of artificial intelligence (AI) computing systems for large language models (LLMs) training. The rapid advancement of LLMs in recent years, coupled with the widespread adoption of algorithms and applications such as BERT, ChatGPT, and DeepSeek, has sparked significant interest in this field. We classify LLMs into encoder-only, encoder-decoder, and decoder-only models, and briefly analyze their training and inference processes to emphasize their substantial need for computational resources. These operations depend heavily on AI-specific accelerators like GPUs (graphics processing units), TPUs (tensor processing units), and MLUs (machine learning units). However, as the gap widens between the increasing complexity of LLMs and the current capabilities of accelerators, it becomes essential to adopt heterogeneous computing systems optimized for distributed environments to manage the growing computational and memory requirements of LLMs. We delve into the execution and scheduling of LLM algorithms, underlining the critical role of distributed computing strategies, memory management enhancements, and boosting computational efficiency. This paper clarifies the complex relationship between algorithm design, hardware infrastructure, and software optimization, and provides an in-depth understanding of both the software and hardware infrastructure supporting LLMs training, offering insights into the challenges and potential avenues for future development and deployment.

FullText(HTML)

References (179)

Relative Articles

Supplements (5)

Cited By

AI Computing Systems for Large Language Models Training

Abstract

Catalog

Export File

Citation

Format

Content