大语言模型量化技术综述：释放硬件效率潜力

陈逸东; 郑楷頵; 郭振华; 张齐颢; 张永华; 翟季冬

doi:10.1007/s11390-026-5979-1

大语言模型量化技术综述：释放硬件效率潜力

A Survey of Quantization in LLM: Unlocking Potential Hardware Efficiency

摘要

摘要: 大语言模型（LLMs）在自然语言处理领域取得了显著进展，但其庞大的参数量级带来了巨大的计算与存储开销，限制了其在资源受限环境中的部署与应用。模型量化作为一种高效的模型压缩技术，通过降低模型参数和/或激活值的数值精度，能显著减少LLMs的内存占用和计算需求，同时致力于保持最小的性能损失。本文旨在全面综述LLM量化的最新进展，涵盖从预训练到推理阶段的各种技术。我们深入探讨了预训练阶段的量化技术，包括FP8、FP4等低精度格式的应用；量化微调阶段的后训练量化（PTQ）与量化感知训练（QAT）两大范式及其具体方法（如QLoRA、GPTQ、AWQ、Loft-Q等）；以及推理阶段的各种量化方法，包括仅权重量化、权重-激活量化、处理离群值的混合精度量化以及KV Cache量化等。通过对这些方法的原理、优势、挑战及其在LLM中应用场景的深入分析，本文旨在为研究者和工程师提供关于LLM量化技术的系统性认识，以帮助明确未来的研究方向。同时，本文也探讨了如何针对不同芯片架构生成高性能低精度计算内核的关键问题。尽管量化技术已取得重大成就，推动了大模型在资源受限环境中的部署，但仍面临诸多挑战，例如极低比特量化下的精度损失、离群值处理的普适性、硬件支持与软件栈优化等。未来的研究将聚焦于开发更精细、自适应的量化策略，加强硬件-软件协同设计，推进可微分量化与自动化量化，并探索量化与其他模型压缩技术的有效结合。随着这些挑战的逐步解决，量化技术必将在推动LLMs的广泛普及与应用中发挥更为关键的作用，让人工智能的强大能力更加触手可及。

Abstract: Large language models (LLMs) have achieved remarkable progress in natural language processing, but their immense scale leads to significant computational and storage overheads, limiting their deployment and widespread application in resource-constrained environments. Model quantization, as an effective model compression technique, significantly reduces LLMs' memory footprint and computational requirements by lowering the numerical precision of model parameters and/or activations, while striving to maintain minimal performance loss. This survey aims to comprehensively review the latest advancements in LLM quantization, covering various techniques from the pretraining phase to the inference phase. We will delve into state of the art quantization during pretraining, post-training quantization, and quantization-aware training in quantization fine-tuning, and various quantization methods during inference. Through in-depth analysis of these methods, this survey seeks to provide researchers and engineers with a comprehensive understanding of LLM quantization techniques to identify future research directions and offers an insight of how to generate high performance low-precision kernels in different chips.

HTML全文

参考文献()

施引文献

资源附件()