We use cookies to improve your experience with our site.
Zhen-Xing Zhang, Yuan-Bo Wen, Han-Qi Lv, Chang Liu, Rui Zhang, Xia-Qing Li, Chao Wang, Zi-Dong Du, Qi Guo, Ling Li, Xue-Hai Zhou, Yun-Ji Chen. AI computing systems for LLMs training: a review[J]. Journal of Computer Science and Technology. DOI: 10.1007/s11390-024-4178-1
Citation: Zhen-Xing Zhang, Yuan-Bo Wen, Han-Qi Lv, Chang Liu, Rui Zhang, Xia-Qing Li, Chao Wang, Zi-Dong Du, Qi Guo, Ling Li, Xue-Hai Zhou, Yun-Ji Chen. AI computing systems for LLMs training: a review[J]. Journal of Computer Science and Technology. DOI: 10.1007/s11390-024-4178-1

AI computing systems for LLMs training: a review

  • In this paper, we present a comprehensive overview of Artificial Intelligence (AI) computing systems for Large Language Models (LLMs) training. The rapid advancement of LLMs in recent years, coupled with the widespread adoption of algorithms and applications such as BERT, the GPT series, and ChatGPT, has sparked significant interest in this field. We classify LLMs into encoder-only, encoder-decoder, and decoder-only frameworks, briefly analyzing their training and inference processes to emphasize their substantial need for computational resources. These operations depend heavily on AI-specific accelerators like GPUs, TPUs, and MLUs. However, the gap between the increasing complexity of LLMs and the current capability of accelerators is widening, making distributed-friendly heterogeneous computing systems essential for managing the growing computational and memory requirements of LLMs. We delve into the execution and scheduling of LLM algorithms, underlining the critical role of distributed computing strategies, memory management enhancements, and boosting computational efficiency. This overview clarifies the complex relationship between algorithm design, hardware infrastructure, and software optimization, and provide an in-depth understanding of both the software and hardware infrastructure supporting LLMs, offering insights into the challenges and potential avenues for future development and deployment. We look forward to helping more readers understand how AI actually operates in LLMs training.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return