SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.
Citation: | Zhang ZX, Wen YB, Lyu HQ et al. AI computing systems for large language models training. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 40(1): 6−41, Jan. 2025. DOI: 10.1007/s11390-024-4178-1 |
In this paper, we present a comprehensive overview of artificial intelligence (AI) computing systems for large language models (LLMs) training. The rapid advancement of LLMs in recent years, coupled with the widespread adoption of algorithms and applications such as BERT, ChatGPT, and DeepSeek, has sparked significant interest in this field. We classify LLMs into encoder-only, encoder-decoder, and decoder-only models, and briefly analyze their training and inference processes to emphasize their substantial need for computational resources. These operations depend heavily on AI-specific accelerators like GPUs (graphics processing units), TPUs (tensor processing units), and MLUs (machine learning units). However, as the gap widens between the increasing complexity of LLMs and the current capabilities of accelerators, it becomes essential to adopt heterogeneous computing systems optimized for distributed environments to manage the growing computational and memory requirements of LLMs. We delve into the execution and scheduling of LLM algorithms, underlining the critical role of distributed computing strategies, memory management enhancements, and boosting computational efficiency. This paper clarifies the complex relationship between algorithm design, hardware infrastructure, and software optimization, and provides an in-depth understanding of both the software and hardware infrastructure supporting LLMs training, offering insights into the challenges and potential avenues for future development and deployment.
[1] |
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010.
|
[2] |
Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jun. 2018, pp.4171–4186. DOI: 10.18653/v1/N19-1423.
|
[3] |
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog, 2019, 1(8): Article No. 9.
|
[4] |
Brown T B, Mann B, Ryder N et al. Language models are few-shot learners. In Proc. the 34th International Conference on Neural Information Processing Systems, Dec. 2020, Article No. 159.
|
[5] |
Liu A, Feng B, Wang B et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv: 2405.04434, 2024. https://arxiv.org/abs/2405.04434, Jan. 2025.
|
[6] |
Liu A, Feng B, Xue B et al. Deepseek-v3 technical report. arXiv: 2412.19437, 2024. https://arxiv.org/abs/2412.19437, Jan. 2025.
|
[7] |
Chen Y, Li L, Li W, Guo Q, Du Z, Xu Z. AI Computing Systems: An Application-Driven Perspective. Elsevier, 2023. DOI: 10.1016/C2021-0-02950-3.
|
[8] |
Chen T, Du Z, Sun N, Wang J, Wu C, Chen Y, Temam O. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proc. the 19th International Conference on Architectural Support for Programming Languages and Operating systems, Mar. 2014, pp.269–284. DOI: 10.1145/2541940.2541967.
|
[9] |
Chen Y, Luo T, Liu S, Zhang S, He L, Wang J, Li L, Chen T, Xu Z, Sun N, Temam O. DaDianNao: A machine-learning supercomputer. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2014, pp.609–622. DOI: 10.1109/MICRO.2014.58.
|
[10] |
Liu D, Chen T, Liu S, Zhou J, Zhou S, Teman O, Feng X, Zhou X, Chen Y. PuDianNao: A polyvalent machine learning accelerator. In Proc. the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2015, pp.369–381. DOI: 10.1145/2694344.2694358.
|
[11] |
Du Z, Fasthuber R, Chen T, Ienne P, Li L, Luo T, Feng X, Chen Y, Temam O. ShiDianNao: Shifting vision processing closer to the sensor. In Proc. the 42nd ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), Jun. 2015, pp.92–104. DOI: 10.1145/2749469.2750389.
|
[12] |
Jouppi N P, Young C, Patil N et al. In-datacenter performance analysis of a tensor processing unit. In Proc. the 44th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), Jun. 2017, pp.1–12. DOI: 10.1145/3079856.3080246.
|
[13] |
Rasley J, Rajbhandari S, Ruwase O, He Y. DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proc. the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Jul. 2020, pp.3505–3506. DOI: 10.1145/3394486.3406703.
|
[14] |
Li C, Yao Z, Wu X, Zhang M, Holmes C, Li C, He Y. DeepSpeed data efficiency: Improving deep learning model quality and training efficiency via efficient data sampling and routing. In Proc. the 38th AAAI Conference on Artificial Intelligence, Feb. 2024, pp.18490–18498. DOI: 10.1609/aaai.v38i16.29810.
|
[15] |
Aminabadi R Y, Rajbhandari S, Awan A A, Li C, Li D, Zheng E, Ruwase O, Smith S, Zhang M, Rasley J, He Y. DeepSpeed-inference: Enabling efficient inference of transformer models at unprecedented scale. In Proc. the 2022 International Conference on High Performance Computing, Networking, Storage and Analysis, Nov. 2022, Article No. 46.
|
[16] |
Rajbhandari S, Li C, Yao Z, Zhang M, Aminabadi R Y, Awan A A, Rasley J, He Y. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale. In Proc. the 39th International Conference on Machine Learning, Jul. 2022, pp.18332–18346.
|
[17] |
Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv: 1909.08053, 2019. https://arxiv.org/abs/1909.08053, Nov. 2024.
|
[18] |
Narayanan D, Shoeybi M, Casper J et al. Efficient large-scale language model training on GPU clusters using megatron-LM. In Proc. the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2021.
|
[19] |
Lin T, Wang Y, Liu X, Qiu X. A survey of transformers. AI Open, 2022, 3: 111–132. DOI: 10.1016/j.aiopen.2022.10.001.
|
[20] |
Zhao W X, Zhou K, Li J, Tang T, Wang X, Hou Y, Min Y, Zhang B, Zhang J, Dong Z, Du Y, Yang C, Chen Y, Chen Z, Jiang J, Ren R, Li Y, Tang X, Liu Z, Liu P, Nie J Y, Wen J R. A survey of large language models. arXiv: 2303.18223, 2023. https://arxiv.org/abs/2303.18223, Nov. 2024.
|
[21] |
Wu J, Gan W, Chen Z, Wan S, Philip S Y. Multimodal large language models: A survey. In Proc. the 2023 IEEE International Conference on Big Data (BigData), Dec. 2023, pp.2247–2256. DOI: 10.1109/BigData59044.2023.10386743.
|
[22] |
Kim S, Hooper C, Wattanawong T, Kang M, Yan R, Genc H, Dinh G, Huang Q, Keutzer K, Mahoney M W, Shao Y S, Gholami A. Full stack optimization of transformer inference: A survey. arXiv: 2302.14017, 2023. https://arxiv.org/abs/2302.14017, Nov. 2024.
|
[23] |
Miao X, Oliaro G, Zhang Z, Cheng X, Jin H, Chen T, Jia Z. Towards efficient generative large language model serving: A survey from algorithms to systems. arXiv: 2312.15234, 2023. https://arxiv.org/abs/2312.15234, Nov. 2024.
|
[24] |
Shen L, Sun Y, Yu Z, Ding L, Tian X, Tao D. On efficient training of large-scale deep learning models: A literature review. arXiv: 2304.03589, 2023. https://arxiv.org/abs/2304.03589, Nov. 2024.
|
[25] |
Zhuang B, Liu J, Pan Z, He H, Weng Y, Shen C. A survey on efficient training of transformers. In Proc. the 32nd International Joint Conference on Artificial Intelligence, Aug. 2023, pp.6823–6831. DOI: 10.24963/ijcai.2023/764.
|
[26] |
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu P J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): 140.
|
[27] |
OpenAI. GPT-4 technical report. arXiv: 2303.08774, 2023. https://arxiv.org/abs/2303.08774, Nov. 2024.
|
[28] |
Sun Y, Wang S, Feng S, Ding S, Pang C, Shang J, Liu J, Chen X, Zhao Y, Lu Y, Liu W, Wu Z, Gong W, Liang J, Shang Z, Sun P, Liu W, Ouyang X, Yu D, Tian H, Wu H, Wang H. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv: 2107.02137, 2021. https://arxiv.org/abs/2107.02137, Nov. 2024.
|
[29] |
Rae J W, Borgeaud S, Cai T et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv: 2112.11446, 2021. https://arxiv.org/abs/2112.11446, Nov. 2024.
|
[30] |
Chowdhery A, Narang S, Devlin J et al. PaLM: Scaling language modeling with pathways. The Journal of Machine Learning Research, 2023, 24(1): 240.
|
[31] |
Zhang S, Roller S, Goyal N et al. OPT: Open pre-trained transformer language models. arXiv: 2205.01068, 2022. https://arxiv.org/abs/2205.01068, Nov. 2024.
|
[32] |
Scao TL, Fan A, Akiki C et al. BLOOM: A 176b-parameter open-access multilingual language model. arXiv: 2211.05100, 2022. https://arxiv.org/abs/2211.05100, Nov. 2024.
|
[33] |
Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, Tang J. GLM: General language model pretraining with autoregressive blank infilling. In Proc. the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022, pp.320–335. DOI: 10.18653/v1/2022.acl-long.26.
|
[34] |
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: Open and efficient foundation language models. arXiv: 2302.13971, 2023. https://arxiv.org/abs/2302.13971, Nov. 2024.
|
[35] |
Ren X, Zhou P, Meng X, Huang X, Wang Y, Wang W, Li P, Zhang X, Podolskiy A, Arshinov G, Bout A, Piontkovskaya I, Wei J, Jiang X, Su T, Liu Q, Yao J. PanGu-Σ: Towards trillion parameter language model with sparse heterogeneous computing. arXiv: 2303.10845, 2023. https://arxiv.org/abs/2303.10845, Nov. 2024.
|
[36] |
Touvron H, Martin L, Stone K et al. Llama 2: Open foundation and fine-tuned chat models. arXiv: 2307.09288, 2023. https://arxiv.org/abs/2307.09288, Nov. 2024.
|
[37] |
Li Z, Yang B, Liu Q, Ma Z, Zhang S, Yang J, Sun Y, Liu Y, Bai X. Monkey: Image resolution and text label are important things for large multi-modal models. In Proc. the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2024, pp.26753–26763. DOI: 10.1109/CVPR52733.2024.02527.
|
[38] |
Anil R, Borgeand S, Alayrac JB, et al. Gemini: A family of highly capable multimodal models. arXiv: 2312.11805, 2023. https://arxiv.org/abs/2312.11805, Nov. 2024.
|
[39] |
Bai J, Bai S, Chu Y et al. Qwen technical report. arXiv: 2309.16609, 2023. https://arxiv.org/abs/2309.16609, Nov. 2024.
|
[40] |
Kaplan J, McCandlish S, Henighan T, Brown T B, Chess B, Child R, Gray S, Radford A, Wu J, Amodei D. Scaling laws for neural language models. arXiv: 2001.08361, 2020. https://arxiv.org/abs/2001.08361, Nov. 2024.
|
[41] |
Du N, Huang Y, Dai A M et al. GLaM: Efficient scaling of language models with mixture-of-experts. In Proc. the 39th International Conference on Machine Learning, Jul. 2022, pp.5547–5569.
|
[42] |
Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 2022, 23(1): 120.
|
[43] |
Dai D, Deng C, Zhao C et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv: 2401.06066, 2024. https://arxiv.org/abs/2401.06066, Jan. 2025.
|
[44] |
Child R, Gray S, Radford A, Sutskever I. Generating long sequences with sparse transformers. arXiv: 1904.10509, 2019. https://arxiv.org/abs/1904.10509, Nov. 2024.
|
[45] |
Kitaev N, Kaiser L, Levskaya A. Reformer: The efficient transformer. In Proc. the 8th International Conference on Learning Representations, Apr. 2020.
|
[46] |
Beltagy I, Peters M E, Cohan A. Longformer: The long-document transformer. arXiv: 2004.05150, 2020. https://arxiv.org/abs/2004.05150, Nov. 2024.
|
[47] |
Zaheer M, Guruganesh G, Dubey A, Ainslie J, Alberti C, Ontanon S, Pham P, Ravula A, Wang Q, Yang L, Ahmed A. Big bird: Transformers for longer sequences. In Proc. the 34th International Conference on Neural Information Processing Systems, Dec. 2020, Article No. 1450.
|
[48] |
Xiao G, Tian Y, Chen B, Han S, Lewis M. Efficient streaming language models with attention sinks. In Proc. the 12th International Conference on Learning Representations, May 2024.
|
[49] |
Ratner N, Levine Y, Belinkov Y et al. Parallel context windows for large language models. In Proc. the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2023, pp.6383–6402. DOI: 10.18653/v1/2023.acl-long.352.
|
[50] |
Ding J, Ma S, Dong L, Zhang X, Huang S, Wang W, Zheng N, Wei F. LongNet: Scaling transformers to 1, 000, 000, 000 tokens. arXiv: 2307.02486, 2023. https://arxiv.org/abs/2307.02486, Nov. 2024.
|
[51] |
Tay Y, Dehghani M, Bahri D, Metzler D. Efficient transformers: A survey. ACM Computing Surveys, 2023, 55(6): 109. DOI: 10.1145/3530811.
|
[52] |
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp.7871–7880. DOI: 10.18653/v1/2020.acl-main.703.
|
[53] |
Xue L, Constant N, Roberts A, Kale M, Al-Rfou R, Siddhant A, Barua A, Raffel C. mT5: A massively multilingual pre-trained text-to-text transformer. In Proc. the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2021, pp.483–498. DOI: 10.18653/v1/2021.naacl-main.41.
|
[54] |
Tay Y, Dehghani M, Tran V Q, Garcia X, Wei J, Wang X, Chung H W, Bahri D, Schuster T, Zheng S, Zhou D, Houlsby N, Metzler D. UL2: Unifying language learning paradigms. In Proc. the Eleventh International Conference on Learning Representations, May 2023.
|
[55] |
Wang Y, Le H, Gotmare A, Bui N D, Li J, Hoi S C. CodeT5+: Open code large language models for code understanding and generation. In Proc. the 2023 Conference on Empirical Methods in Natural Language Processing, Dec. 2023, pp.1069–1088. DOI: 10.18653/v1/2023.emnlp-main.68.
|
[56] |
Soltan S, Ananthakrishnan S, FitzGerald J, Gupta R, Hamza W, Khan H, Peris C, Rawls S, Rosenbaum A, Rumshisky A, Prakash C S, Sridhar M, Triefenbach F, Verma A, Tur G, Natarajan P. AlexaTM 20B: Few-shot learning using a large-scale multilingual seq2seq model. arXiv: 2208.01448, 2022. https://arxiv.org/abs/2208.01448, Nov. 2024.
|
[57] |
Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. LoRA: Low-rank adaptation of large language models. In Proc. the 10th International Conference on Learning Representations, Apr. 2021.
|
[58] |
Li X L, Liang P. Prefix-tuning: Optimizing continuous prompts for generation. In Proc. the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Aug. 2021, pp.4582–4597. DOI: 10.18653/v1/2021.acl-long.353.
|
[59] |
Hu Z, Wang L, Lan Y, Xu W, Lim E P, Bing L, Xu X, Poria S, Lee R. LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models. In Proc. the 2023 Conference on Empirical Methods in Natural Language Processing, Dec. 2023, pp.5254–5276. DOI: 10.18653/v1/2023.emnlp-main.319.
|
[60] |
Wei J, Bosma M, Zhao V Y, Guu K, Yu A W, Lester B, Du N, Dai A M, Le Q V. Finetuned language models are zero-shot learners. In Proc. the 10th International Conference on Learning Representations, Apr. 2022.
|
[61] |
Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E H, Le Q V, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2022, Article No. 1800.
|
[62] |
Ouyang L, Wu J, Jiang X et al. Training language models to follow instructions with human feedback. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2022, Article No. 2011.
|
[63] |
Yu T, Zhu H. Hyper-parameter optimization: A review of algorithms and applications. arXiv: 2003.05689, 2020. https://arxiv.org/abs/2003.05689, Nov. 2024.
|
[64] |
Williams R J, Zipser D. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1989, 1(2): 270–280. 10.1162/neco. 1989.1. 2.270.
|
[65] |
Wang H, Zhang Z, Han S. SpAtten: Efficient sparse attention architecture with cascade token and head pruning. In Proc. the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Feb. 27-Mar. 3, 2021, pp.97–110. DOI: 10.1109/HPCA51647.2021.00018.
|
[66] |
Zeng W, Li M, Xiong W, Tong T, Lu W J, Tan J, Wang R, Huang R. MPCViT: Searching for accurate and efficient MPC-friendly vision transformer with heterogeneous attention. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision, Oct. 2023, pp.5052–5063. DOI: 10.1109/ICCV51070.2023.00466.
|
[67] |
Liu L, Qu Z, Chen Z, Ding Y, Xie Y. Transformer acceleration with dynamic sparse attention. arXiv: 2110.11299, 2021. https://arxiv.org/abs/2110.11299, Nov. 2024.
|
[68] |
Stevens J R, Venkatesan R, Dai S, Khailany B, Raghunathan A. Softermax: Hardware/software co-design of an efficient softmax for transformers. In Proc. the 58th ACM/IEEE Design Automation Conference (DAC), Dec. 2021, pp.469–474. DOI: 10.1109/DAC18074.2021.9586134.
|
[69] |
Liu S, Du Z, Tao J, Han D, Luo T, Xie Y, Chen Y, Chen T. Cambricon: An instruction set architecture for neural networks. In Proc. the 43rd ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), Jun. 2016, pp.393–405. DOI: 10.1109/ISCA.2016.42.
|
[70] |
Liu K, Jiang Z, Zhang J, Wei H, Zhong X, Tan L, Pan T, Huang T. Hostping: Diagnosing intra-host network bottlenecks in RDMA servers. In Proc. the 20th USENIX Symposium on Networked Systems Design and Implementation, Apr. 2023, pp.15–29.
|
[71] |
Li A, Song S L, Chen J, Li J, Liu X, Tallent N R, Barker K J. Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Trans. Parallel and Distributed Systems, 2020, 31(1): 94–110. DOI: 10.1109/TPDS.2019.2928289.
|
[72] |
Vazhkudai S S, De Supinski B R, Bland A S et al. The design, deployment, and evaluation of the CORAL pre-exascale systems. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2018, pp.661–672. DOI: 10.1109/SC.2018.00055.
|
[73] |
Sharma D D. Compute express link®: An open industry-standard interconnect enabling heterogeneous data-centric computing. In Proc. the 2022 IEEE Symposium on High-Performance Interconnects (HOTI), Aug. 2022, pp.5–12. DOI: 10.1109/HOTI55740.2022.00017.
|
[74] |
Qi H, Dai L, Chen W, Jia Z, Lu X. Performance characterization of large language models on high-speed interconnects. In Proc. the 2023 IEEE Symposium on High-Performance Interconnects (HOTI), Aug. 2023, pp.53–60. DOI: 10.1109/HOTI59126.2023.00022.
|
[75] |
Jouppi N, Kurian G, Li S, Ma P, Nagarajan R, Nai L, Patil N, Subramanian S, Swing A, Towles B, Young C, Zhou X, Zhou Z, Patterson D A. TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proc. the 50th Annual International Symposium on Computer Architecture, Jun. 2023, Article No. 82. DOI: 10.1145/3579371.3589350.
|
[76] |
Jiang Z, Lin H, Zhong Y et al. Megascale: Scaling large language model training to more than 10000 GPUs. In Proc. the 21st USENIX Symposium on Networked Systems Design and Implementation, Apr. 2024, pp.745–760.
|
[77] |
Naumov M, Kim J, Mudigere D et al. Deep learning training in facebook data centers: Design of scale-up and scale-out systems. arXiv: 2003.09518, 2020. https://arxiv.org/abs/2003.09518, Nov. 2024.
|
[78] |
Tokui S, Okuta R, Akiba T, Niitani Y, Ogawa T, Saito S, Suzuki S, Uenishi K, Vogel B, Yamazaki Vincent H. Chainer: A deep learning framework for accelerating the research cycle. In Proc. the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Aug. 2019, pp.2002–2011. DOI: 10.1145/3292500.3330756.
|
[79] |
Shazeer N, Cheng Y, Parmar N et al. Mesh-TensorFlow: Deep learning for supercomputers. In Proc. the 32nd International Conference on Neural Information Processing Systems, Dec. 2018, pp.10435–10444.
|
[80] |
Kim C, Lee H, Jeong M, Baek W, Yoon B, Kim I, Lim S, Kim S. torchgpipe: On-the-fly pipeline parallelism for training giant models. arXiv: 2004.09910, 2020. https://arxiv.org/abs/2004.09910, Nov. 2024.
|
[81] |
Rajbhandari S, Rasley J, Ruwase O, He Y. ZeRO: Memory optimizations toward training trillion parameter models. In Proc. the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2020, Article No. 20.
|
[82] |
Yuan J, Li X, Cheng C, Liu J, Guo R, Cai S, Yao C, Yang F, Yi X, Wu C, Zhang H, Zhao J. OneFlow: Redesign the distributed deep learning framework from scratch. arXiv: 2110.15032, 2021. https://arxiv.org/abs/2110.15032, Nov. 2024.
|
[83] |
Jia X, Jiang L, Wang A, Xiao W, Shi Z, Zhang J, Li X, Chen L, Li Y, Zheng Z, Liu X, Lin W. Whale: Efficient giant model training over heterogeneous GPUs. In Proc. the 2022 USENIX Annual Technical Conference, Jul. 2022, pp.673–688.
|
[84] |
Singh S, Bhatele A. AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning. In Proc. the 2022 IEEE International Parallel and Distributed Processing Symposium, May 30–Jun. 3, 2022, pp.606–616. DOI: 10.1109/IPDPS53621.2022.00065.
|
[85] |
Athlur S, Saran N, Sivathanu M, Ramjee R, Kwatra N. Varuna: Scalable, low-cost training of massive deep learning models. In Proc. the 17th European Conference on Computer Systems, Apr. 2022, pp.472–487. DOI: 10.1145/3492321.3519584.
|
[86] |
Li S, Liu H, Bian Z, Fang J, Huang H, Liu Y, Wang B, You Y. Colossal-AI: A unified deep learning system for large-scale parallel training. In Proc. the 52nd International Conference on Parallel Processing, Aug. 2023, pp.766–775. DOI: 10.1145/3605573.3605613.
|
[87] |
Lai Z, Li S, Tang X, Ge K, Liu W, Duan Y, Qiao L, Li D. Merak: An efficient distributed DNN training framework with automated 3D parallelism for giant foundation models. IEEE Trans. Parallel and Distributed Systems, 2023, 34(5): 1466–1478. DOI: 10.1109/TPDS.2023.3247001.
|
[88] |
Thorpe J, Zhao P, Eyolfson J, Qiao Y, Jia Z, Zhang M, Netravali R, Xu G H. Bamboo: Making preemptible instances resilient for affordable training of large DNNs. In Proc. the 20th USENIX Symposium on Networked Systems Design and Implementation, Apr. 2023, pp.497–513.
|
[89] |
Jang I, Yang Z, Zhang Z, Jin X, Chowdhury M. Oobleck: Resilient distributed training of large models using pipeline templates. In Proc. the 29th Symposium on Operating Systems Principles, Oct. 2023, pp.382–395. DOI: 10.1145/3600006.3613152.
|
[90] |
Lepikhin D, Lee H, Xu Y et al. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv:2006.16668, 2025. https://arxiv.org/abs/2006.16668, Jan. 2025.
|
[91] |
Li S, Xue F, Baranwal C, Li Y, You Y. Sequence parallelism: Long sequence training from system perspective. arXiv: 2105.13120, 2021. https://arxiv.org/abs/2105.13120, Jan. 2025.
|
[92] |
Jia X, Song S, He W et al. Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes. arXiv: 1807.11205, 2018. https://arxiv.org/abs/1807.11205, Nov. 2024.
|
[93] |
Xu Y, Lee H, Chen D, Choi H, Hechtman B, Wang S. Automatic cross-replica sharding of weight update in data-parallel training. arXiv: 2004.13336, 2020. https://arxiv.org/abs/2004.13336, Nov. 2024.
|
[94] |
Xu Q, You Y. An efficient 2D method for training super-large deep learning models. In Proc. the 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2023, pp.222–232. DOI: 10.1109/IPDPS54959.2023.00031.
|
[95] |
Wang B, Xu Q, Bian Z, You Y. Tesseract: Parallelize the tensor parallelism efficiently. In Proc. the 51st International Conference on Parallel Processing, Aug. 29-Sept. 1, 2022, Article No. 12. DOI: 10.1145/3545008.3545087.
|
[96] |
Bian Z, Xu Q, Wang B, You Y. Maximizing parallelism in distributed training for huge neural networks. arXiv: 2105.14450, 2021. https://arxiv.org/abs/2105.14450, Nov. 2024.
|
[97] |
Cheng S, Liu Z, Du J, You Y. ATP: Adaptive tensor parallelism for foundation models. arXiv: 2301.08658, 2023. https://arxiv.org/abs/2301.08658, Nov. 2024.
|
[98] |
Zeng Z, Liu C, Tang Z, Li K, Li K. AccTFM: An effective intra-layer model parallelization strategy for training large-scale transformer-based models. IEEE Trans. Parallel and Distributed Systems, 2022, 33(12): 4326–4338. DOI: 10.1109/TPDS.2022.3187815.
|
[99] |
Wang S, Wei J, Sabne A, Davis A, Ilbeyi B, Hechtman B, Chen D, Murthy K S, Maggioni M, Zhang Q, Kumar S, Guo T, Xu Y, Zhou Z. Overlap communication with dependent computation via decomposition in large deep learning models. In Proc. the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2022, pp.93–106. DOI: 10.1145/3567955.3567959.
|
[100] |
Huang Y, Cheng Y, Bapna A, Firat O, Chen M X, Chen D, Lee H, Ngiam J, Le Q V, Wu Y, Chen Z. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 10.
|
[101] |
Fan S, Rong Y, Meng C, Cao Z, Wang S, Zheng Z, Wu C, Long G, Yang J, Xia L, Diao L, Liu X, Lin W. DAPPLE: A pipelined data parallel approach for training large models. In Proc. the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2021, pp.431–445. DOI: 10.1145/3437801.3441593.
|
[102] |
Narayanan D, Harlap A, Phanishayee A, Seshadri V, Devanur N R, Ganger G R, Gibbons P B, Zaharia M. PipeDream: Generalized pipeline parallelism for DNN training. In Proc. the 27th ACM Symposium on Operating Systems Principles, Oct. 2019, pp.1–15. DOI: 10.1145/3341301.3359646.
|
[103] |
Narayanan D, Phanishayee A, Shi K, Chen X, Zaharia M. Memory-efficient pipeline-parallel DNN training. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.7937–7947.
|
[104] |
Park J H, Yun G, Yi C M, Nguyen N T, Lee S, Choi J, Noh S H, Choi Y R. HetPipe: Enabling large DNN training on (Whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. In Proc. the 2020 USENIX Annual Technical Conference, Jul. 2020, pp.307–321.
|
[105] |
Li S, Hoefler T. Chimera: Efficiently training large-scale neural networks with bidirectional pipelines. In Proc. the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2021, pp.1–14. DOI: 10.1145/3458817.3476145.
|
[106] |
Jain A, Awan A A, Aljuhani A M, Hashmi J M, Anthony Q G, Subramoni H, Panda D K, Machiraju R, Parwani A. GEMS: GPU-enabled memory-aware model-parallelism system for distributed DNN training. In Proc. the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2020, pp.1–15. DOI: 10.1109/SC41405.2020.00049.
|
[107] |
Kim T, Kim H, Yu G I, Chun B G. BPipe: Memory-balanced pipeline parallelism for training large language models. In Proc. the 40th International Conference on Machine Learning, Jul. 2023, pp.16639–16653.
|
[108] |
Zhao S, Li F, Chen X, Guan X, Jiang J, Huang D, Qing Y, Wang S, Wang P, Zhang G, Li C, Luo P, Cui H. vPipe: A virtualized acceleration system for achieving efficient and scalable pipeline parallel DNN training. IEEE Trans. Parallel and Distributed Systems, 2022, 33(3): 489–506. DOI: 10.1109/TPDS.2021.3094364.
|
[109] |
Tanaka M, Taura K, Hanawa T, Torisawa K. Automatic graph partitioning for very large-scale deep learning. In Proc. the 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021, pp.1004–1013. DOI: 10.1109/IPDPS49936.2021.00109.
|
[110] |
Miao X, Wang Y, Jiang Y, Shi C, Nie X, Zhang H, Cui B. Galvatron: Efficient transformer training over multiple GPUs using automatic parallelism. Proc. the VLDB Endowment, 2022, 16(3): 470–479. DOI: 10.14778/3570690.3570697.
|
[111] |
Zheng L, Li Z, Zhang H, Zhuang Y, Chen Z, Huang Y, Wang Y, Xu Y, Zhuo D, Xing E P, Gonzalez J E, Stoica I. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. In Proc. the 16th USENIX Symposium on Operating Systems Design and Implementation, Jul. 2022, pp.559–578.
|
[112] |
Luo Z, Yi X, Long G, Fan S, Wu C, Yang J, Lin W. Efficient pipeline planning for expedited distributed DNN training. In Proc. the 2022 IEEE Conference on Computer Communications, May 2022, pp.340–349. DOI: 10.1109/INFOCOM48880.2022.9796787.
|
[113] |
Tarnawski J, Narayanan D, Phanishayee A. Piper: Multidimensional planner for DNN parallelization. In Proc. the 35th International Conference on Neural Information Processing Systems, Dec. 2021, pp.24829–24840.
|
[114] |
Wang S, Rong Y, Fan S, Zheng Z, Diao L, Long G, Yang J, Liu X, Lin W. Auto-MAP: A DQN framework for exploring distributed execution plans for DNN workloads. arXiv: 2007.04069, 2020. https://arxiv.org/abs/2007.04069, Nov. 2024.
|
[115] |
Jayaram Subramanya S, Arfeen D, Lin S, Qiao A, Jia Z, Ganger G R. Sia: Heterogeneity-aware, goodput-optimized Ml-cluster scheduling. In Proc. the 29th Symposium on Operating Systems Principles, Oct. 2023, pp.642–657. DOI: 10.1145/3600006.3613175.
|
[116] |
Jangda A, Huang J, Liu G, Sabet A H N, Maleki S, Miao Y, Musuvathi M, Mytkowicz T, Saarikivi O. Breaking the computation and communication abstraction barrier in distributed machine learning workloads. In Proc. the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Feb. 28-Mar. 4, 2022, pp.402–416. DOI: 10.1145/3503222.3507778.
|
[117] |
Chen C, Li X, Zhu Q, Duan J, Sun P, Zhang X, Yang C. Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning. In Proc. the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Apr. 27-May 1, 2024, pp.178–191. DOI: 10.1145/3620666.3651379.
|
[118] |
Nicolae B, Li J, Wozniak J M, Bosilca G, Dorier M, Cappello F. DeepFreeze: Towards scalable asynchronous checkpointing of deep learning models. In Proc. the 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), May 2020, pp.172–181. DOI: 10.1109/CCGrid49817.2020.00-76.
|
[119] |
Mohan J, Phanishayee A, Chidambaram V. CheckFreq: Frequent, fine-grained DNN checkpointing. In Proc. the 19th USENIX Conference on File and Storage Technologies, Feb. 2021, pp.203–216.
|
[120] |
Eisenman A, Matam K K, Ingram S, Mudigere D, Krishnamoorthi R, Nair K, Smelyanskiy M, Annavaram M. Check-N-Run: A checkpointing system for training deep learning recommendation models. In Proc. the 19th USENIX Symposium on Networked Systems Design and Implementation, Apr. 2022, pp.929–943.
|
[121] |
Wang Z, Jia Z, Zheng S, Zhang Z, Fu X, Ng T S E, Wang Y. GEMINI: Fast failure recovery in distributed training with in-memory checkpoints. In Proc. the 29th Symposium on Operating Systems Principles, Oct. 2023, pp.364–381. DOI: 10.1145/3600006.3613145.
|
[122] |
Wu B, Xia L, Li Q, Li K, Chen X, Guo Y, Xiang T, Chen Y, Li S. TRANSOM: An efficient fault-tolerant system for training LLMs. arXiv: 2310.10046, 2023. https://arxiv.org/abs/2310.10046, Nov. 2024.
|
[123] |
He T, Li X, Wang Z, Qian K, Xu J, Yu W, Zhou J. Unicron: Economizing self-healing LLM training at scale. arXiv: 2401.00134, 2023. https://arxiv.org/abs/2401.00134, Nov. 2024.
|
[124] |
Jiang Y, Zhu Y, Lan C, Yi B, Cui Y, Guo C. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In Proc. the 14th USENIX Symposium on Operating Systems Design and Implementation, Nov. 2020, pp.463–479.
|
[125] |
Luo Q, He J, Zhuo Y, Qian X. Prague: High-performance heterogeneity-aware asynchronous decentralized training. In Proc. the 25th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2020, pp.401–416. DOI: 10.1145/3373376.3378499.
|
[126] |
Dong J, Cao Z, Zhang T et al. EFLOPS: Algorithm and system co-design for a high performance distributed training platform. In Proc. the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb. 2020, pp.610–622. DOI: 10.1109/HPCA47549.2020.00056.
|
[127] |
Zhao S, Li F, Chen X, Shen T, Chen L, Wang S, Zhang N, Li C, Cui H. NASPipe: High performance and reproducible pipeline parallel supernet training via causal synchronous parallelism. In Proc. the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Feb. 28 -Mar. 4, 2022, pp.374–387. DOI: 10.1145/3503222.3507735.
|
[128] |
Amaral M, Polo J, Carrera D, Seelam S, Steinder M. Topology-aware GPU scheduling for learning workloads in cloud environments. In Proc. the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2017, Article No. 17. DOI: 10.1145/3126908.3126933.
|
[129] |
Chen T, Xu B, Zhang C, Guestrin C. Training deep nets with sublinear memory cost. arXiv: 1604.06174, 2016. https://arxiv.org/abs/1604.06174, Nov. 2024.
|
[130] |
Kusumoto M, Inoue T, Watanabe G, Akiba T, Koyama M. A graph theoretic framework of recomputation algorithms for memory-efficient backpropagation. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 105.
|
[131] |
Jain P, Jain A, Nrusimha A, Gholami A, Abbeel P, Gonzalez J, Keutzer K, Stoica I. Checkmate: Breaking the memory wall with optimal tensor rematerialization. In Proc. the 3rd Conference on Machine Learning and Systems, Mar. 2020, pp.497–511.
|
[132] |
Kirisame M, Lyubomirsky S, Haan A, Brennan J, He M, Roesch J, Chen T, Tatlock Z. Dynamic tensor rematerialization. In Proc. the 9th International Conference on Learning Representations, May 2021.
|
[133] |
Korthikanti V A, Casper J, Lym S, McAfee L, Andersch M, Shoeybi M, Catanzaro B. Reducing activation recomputation in large transformer models. In Proc. the 6th Conference on Machine Learning and Systems, Jun. 2023.
|
[134] |
Rhu M, Gimelshein N, Clemons J, Zulfiqar A, Keckler S W. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In Proc. the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct. 2016, pp.1–13. DOI: 10.1109/MICRO.2016.7783721.
|
[135] |
Ren J, Rajbhandari S, Aminabadi R Y, Ruwase O, Yang S, Zhang M, Li D, He Y. ZeRO-offload: Democratizing billion-scale model training. In Proc. the 2021 USENIX Annual Technical Conference, Jul. 2021, pp.551–564.
|
[136] |
Rajbhandari S, Ruwase O, Rasley J, Smith S, He Y. ZeRO-infinity: Breaking the GPU memory wall for extreme scale deep learning. In Proc. the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2021. DOI: 10.1145/3458817.3476205.
|
[137] |
Fang J R, Zhu Z L, Li S G, Su H, Yu Y, Zhou J, You Y. Parallel training of pre-trained models via chunk-based dynamic memory management. IEEE Trans. Parallel and Distributed Systems, 2023, 34(1): 304–315. DOI: 10.1109/TPDS.2022.3219819.
|
[138] |
Zhang H, Zhou Y, E, Xue Y, Liu Y, Huang J. G10: Enabling an efficient unified GPU memory and storage architecture with smart tensor migrations. In Proc. the 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct. 28–Nov. 1, 2023, pp.395–410.
|
[139] |
Peng X, Shi X, Dai H, Jin H, Ma W, Xiong Q, Yang F, Qian X. Capuchin: Tensor-based GPU memory management for deep learning. In Proc. the 25th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2020, pp.891–905. DOI: 10.1145/3373376.3378505.
|
[140] |
Wahib M, Zhang H, Nguyen T T, Drozd A, Domke J, Zhang L, Takano R, Matsuoka S. Scaling distributed deep learning workloads beyond the memory capacity with KARMA. In Proc. the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2020. DOI: 10.1109/SC41405.2020.00023.
|
[141] |
Zong Z, Lin L, Lin L, Wen L, Sun Y. STR: Hybrid tensor re-generation to break memory wall for DNN training. IEEE Trans. Parallel and Distributed Systems, 2023, 34(8): 2403–2418. DOI: 10.1109/TPDS.2023.3266110.
|
[142] |
Huang C C, Jin G, Li J. SwapAdvisor: Pushing deep learning beyond the GPU memory limit via smart swapping. In Proc. the 25th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2020, pp.1341–1355. DOI: 10.1145/3373376.3378530.
|
[143] |
He S, Chen P, Chen S, Li Z, Yang S, Chen W, Shou L. HOME: A holistic GPU memory management framework for deep learning. IEEE Trans. Computers, 2022, 72(3): 826–838. DOI: 10.1109/TC.2022.3180991.
|
[144] |
Wang L, Ye J, Zhao Y, Wu W, Li A, Song S L, Xu Z, Kraska T. Superneurons: Dynamic GPU memory management for training deep neural networks. In Proc. the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2018, pp.41–53. DOI: 10.1145/3178487.3178491.
|
[145] |
Cambier L, Bhiwandiwalla A, Gong T, Elibol O H, Nekuii M, Tang H. Shifted and squeezed 8-bit floating point format for low-precision training of deep neural networks. In Proc. the 8th International Conference on Learning Representations, Apr. 2020.
|
[146] |
Zhang X, Liu S, Zhang R, Liu C, Huang D, Zhou S, Guo J, Guo Q, Du Z, Zhi T, Chen Y. Fixed-point back-propagation training. In Proc. the 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), Jun. 2020, pp.2327–2335. DOI: 10.1109/CVPR42600.2020.00240.
|
[147] |
Peng H, Wu K, Wei Y et al. FP8-LM: Training FP8 large language models. arXiv: 2310.18313, 2023. https://arxiv.org/abs/2310.18313, Nov. 2024.
|
[148] |
Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLORA: Efficient finetuning of quantized LLMs. In Proc. the 37th International Conference on Neural Information Processing Systems, Dec. 2023, Article No. 441.
|
[149] |
Pan Z, Chen P, He H, Liu J, Cai J, Zhuang B. Mesa: A memory-saving training framework for transformers. arXiv: 2111.11124, 2021. https://arxiv.org/abs/2111.11124, Nov. 2024.
|
[150] |
Liu X, Zheng L, Wang D, Cen Y, Chen W, Han X, Chen J, Liu Z, Tang J, Gonzalez J, Mahoney M W, Cheung A. GACT: Activation compressed training for generic network architectures. In Proc. the 39th International Conference on Machine Learning, Jul. 2022, pp.14139–14152.
|
[151] |
Seide F, Fu H, Droppo J, Li G, Yu D. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Proc. the 15th Annual Conference of the International Speech Communication Association, Sept. 2014, pp.1058–1062.
|
[152] |
Tang H, Gan S, Awan A A, Rajbhandari S, Li C, Lian X, Liu J, Zhang C, He Y. 1-bit Adam: Communication efficient large-scale training with Adam's convergence speed. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.10118–10129.
|
[153] |
Lu Y, Li C, Zhang M, De Sa C, He Y. Maximizing communication efficiency for large-scale training via 0/1 Adam. In Proc. the 11th International Conference on Learning Representations, May 2023.
|
[154] |
Jia Z, Padon O, Thomas J, Warszawski T, Zaharia M, Aiken A. TASO: Optimizing deep learning computation with automatic generation of graph substitutions. In Proc. the 27th ACM Symposium on Operating Systems Principles, Oct. 2019, pp.47–62. DOI: 10.1145/3341301.3359630.
|
[155] |
Ivanov A, Dryden N, Ben-Nun T, Li S, Hoefler T. Data movement is all you need: A case study on optimizing transformers. In Proc. the 4th Conference on Machine Learning and Systems, Apr. 2021, pp.711–732.
|
[156] |
Rabe M N, Staats C. Self-attention does not need O(n2) memory. arXiv: 2112.05682, 2021. https://arxiv.org/abs/2112.05682, Nov. 2024.
|
[157] |
Dao T, Fu D Y, Ermon S, Rudra A, Ré C. FLASHATTENTION: Fast and memory-efficient exact attention with IO-awareness. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 28–Dec. 9, 2022, Article No. 1189.
|
[158] |
Dao T. FlashAttention-2: Faster attention with better parallelism and work partitioning. In Proc. the 12th International Conference on Learning Representations, May 2024.
|
[159] |
Zheng S, Chen S, Song P, Chen R, Li X, Yan S, Lin D, Leng J, Liang Y. Chimera: An analytical optimizing framework for effective compute-intensive operators fusion. In Proc. the 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Feb. 25–Mar. 1, 2023, pp.1113–1126. DOI: 10.1109/HPCA56546.2023.10071018.
|
[160] |
Wang F, Shen M. Automatic kernel generation for large language models on deep learning accelerators. In Proc. the 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), Oct. 29–Nov. 2, 2023, pp.1–9. DOI: 10.1109/ICCAD57390.2023.10323944.
|
[161] |
Yang Y, Deng L, Wu S, Yan T, Xie Y, Li G. Training high-performance and large-scale deep neural networks with full 8-bit integers. Neural Networks, 2020, 125: 70–82. DOI: 10.1016/j.neunet.2019.12.027.
|
[162] |
Liu C, Zhang X, Zhang R, Li L, Zhou S, Huang D, Li Z, Du Z, Liu S, Chen T. Rethinking the importance of quantization bias, toward full low-bit training. IEEE Trans. Image Processing, 2022, 31: 7006–7019. DOI: 10.1109/TIP.2022.3216776.
|
[163] |
Zhao Y, Liu C, Du Z, Guo Q, Hu X, Zhuang Y, Zhang Z, Song X, Li W, Zhang X, Li L, Xu Z, Chen T. Cambricon-Q: A hybrid architecture for efficient training. In Proc. the 48th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), Jun. 2021, pp.706–719. DOI: 10.1109/ISCA52012.2021.00061.
|
[164] |
Shi Y, Yang Z, Xue J, Ma L, Xia Y, Miao Z, Guo Y, Yang F, Zhou L. Welder: Scheduling deep learning memory access via tile-graph. In Proc. the 17th USENIX Symposium on Operating Systems Design and Implementation, Jul. 2023, pp.701–718.
|
[165] |
Feng S, Hou B, Jin H, Lin W, Shao J, Lai R, Ye Z, Zheng L, Yu C H, Yu Y, Chen T. TensorIR: An abstraction for automatic tensorized program optimization. In Proc. the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2023, pp.804–817. DOI: 10.1145/3575693.3576933.
|
[166] |
Bi J, Guo Q, Li X, Zhao Y, Wen Y, Guo Y, Zhou E, Hu X, Du Z, Li L, Chen H, Chen T. Heron: Automatically constrained high-performance library generation for deep learning accelerators. In Proc. the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2023, pp.314–328. DOI: 10.1145/3582016.3582061.
|
[167] |
Kim S, Gholami A, Yao Z, Mahoney M W, Keutzer K. I-BERT: Integer-only BERT quantization. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.5506–5518.
|
[168] |
Kwon W, Li Z, Zhuang S, Sheng Y, Zheng L, Yu C H, Gonzalez J, Zhang H, Stoica I. Efficient memory management for large language model serving with PagedAttention. In Proc. the 29th Symposium on Operating Systems Principles, Oct. 2023, pp.611–626. DOI: 10.1145/3600006.3613165.
|
[169] |
Liu Z, Desai A, Liao F, Wang W, Xie V, Xu Z, Kyrillidis A, Shrivastava A. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. In Proc. the 37th International Conference on Neural Information Processing Systems, Dec. 2023, Article No. 2279.
|
[170] |
Ge S, Zhang Y, Liu L, Zhang M, Han J, Gao J. Model tells you what to discard: Adaptive KV cache compression for LLMs. In Proc. the 12th International Conference on Learning Representations, May 2024.
|
[171] |
Zhai Y, Jiang C, Wang L, Jia X, Zhang S, Chen Z, Liu X, Zhu Y. ByteTransformer: A high-performance transformer boosted for variable-length inputs. In Proc. the 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2023, pp.344–355. DOI: 10.1109/IPDPS54959.2023.00042.
|
[172] |
Yu G I, Jeong J S, Kim G W, Kim S, Chun B G. Orca: A distributed serving system for transformer-based generative models. In Proc. the 16th USENIX Symposium on Operating Systems Design and Implementation, Jul. 2022, pp.521–538.
|
[173] |
Agrawal A, Panwar A, Mohan J, Kwatra N, Gulavani B S, Ramjee R. SARATHI: Efficient LLM inference by piggybacking decodes with chunked prefills. arXiv: 2308.16369, 2023. https://arxiv.org/abs/2308.16369, Nov. 2024.
|
[174] |
Shen H, Roesch J, Chen Z, Chen W, Wu Y, Li M, Sharma V, Tatlock Z, Wang Y. Nimble: Efficiently compiling dynamic neural networks for model inference. In Proc. the Fourth Conference on Machine Learning and Systems, Apr. 2021, pp.208–222.
|
[175] |
Fegade P, Chen T, Gibbons P, Mowry T. The CoRa tensor compiler: Compilation for ragged tensors with minimal padding. In Proc. the 5th Conference on Machine Learning and Systems, Aug. 2022, pp.721–747.
|
[176] |
Sun Y, Dong L, Huang S, Ma S, Xia Y, Xue J, Wang J, Wei F. Retentive network: A successor to transformer for large language models. arXiv: 2307.08621, 2023. https://arxiv.org/abs/2307.08621, Nov. 2024.
|
[177] |
Yu L, Simig D, Flaherty C, Aghajanyan A, Zettlemoyer L, Lewis M. MEGABYTE: Modeling million-byte sequences with multiscale transformers. In Proc. the 37th International Conference on Neural Information Processing Systems, Dec. 2023, Article No. 3447.
|
[178] |
Dao T, Gu A. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In Proc. the 41st International Conference on Machine Learning, Jul. 2024.
|
[179] |
Zhao L, Maleki S, Shah A, Yang Z, Pourreza H, Krishnamurthy A. ForestColl: Efficient collective communications on heterogeneous network fabrics. arXiv: 2402.06787, 2024. https://arxiv.org/abs/2402.06787, Nov. 2024.
|