Data Preparation for Large Language Models

Hao Liang; Zhen Hao Wong; Rui-Tong Liu; Yu-Han Wang; Mei-Yi Qiang; Zheng-Yang Zhao; Cheng-Yu Shen; Cong-Hui He; Wen-Tao Zhang; Bin Cui

doi:10.1007/s11390-026-5948-8

Liang H, Wong ZH, Liu RT et al. Data preparation for large language models. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 41(1): 289−317, Jan. 2026. DOI: 10.1007/s11390-026-5948-8

Citation:

Liang H, Wong ZH, Liu RT et al. Data preparation for large language models. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 41(1): 289−317, Jan. 2026. DOI: 10.1007/s11390-026-5948-8

Citation:

Liang H, Wong ZH, Liu RT et al. Data preparation for large language models. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 41(1): 289−317, Jan. 2026. DOI: 10.1007/s11390-026-5948-8

Data Preparation for Large Language Models

Abstract

Abstract

Large language models (LLMs) have demonstrated remarkable generalization capabilities across diverse domains, largely attributed to the availability of massive amounts of high-quality training data. Recently, the development paradigm of LLMs has been shifting from a model-centric to a data-centric perspective. In this paper, we provide a comprehensive survey of data preparation algorithms and workflows for LLMs, categorized into three stages: pre-training, continual pre-training, and post-training. We further summarize widely used datasets along with their associated data preparation method, offering a practical reference for researchers who may lack extensive experience in the field of data preparation. Finally, we outline potential directions for future work, highlighting open challenges and opportunities in advancing data preparation for LLMs.

FullText(HTML)

References (218)

Relative Articles

Supplements (3)

Cited By

Data Preparation for Large Language Models

Abstract

Catalog

Export File

Citation

Format

Content