CodeRankEval: Benchmarking and Analyzing LLM Performance for Code Ranking

Li-Guo Chen; Zheng Xiao; Yi-Jiang Xu; Rui-Chuan An; Xin Wang; Yang-Ning Li; Ying-Hui Li; Yi-Dong Wang; Zheng-Ran Zeng; Qing Gao; Shi-Kun Zhang

doi:10.1007/s11390-025-5514-9

Chen LG, Xiao Z, Xu YJ et al. CodeRankEval: Benchmarking and analyzing LLM performance for code ranking. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 40(5): 1220−1233, Sept. 2025. DOI: 10.1007/s11390-025-5514-9

Citation:

CodeRankEval: Benchmarking and Analyzing LLM Performance for Code Ranking

Abstract

Abstract

Large language models (LLMs) are increasingly applied across diverse software engineering tasks. Consequently, their ability to effectively rank code quality is crucial for applications like selecting optimal solutions and aiding code review. However, evaluating this essential code ranking capability is hampered by a lack of benchmarks covering diverse paradigms and robustness testing. To address this, we introduce CodeRankEval, a benchmark suite for multi-paradigm evaluation, and CodeRankEval-Perturbed for robustness testing against common code flaws. Our empirical study reveals key insights: pairwise ranking yields the highest accuracy but is costly; listwise is the cheapest and shows comparable performance with pairwise; pointwise generally exhibits lower performance with intermediate cost. Besides, ranking ability correlates positively with generation ability, models show reasonable robustness to perturbations but may exhibit positional bias. Overall, this work provides valuable resources and insights for understanding and improving LLM-based code ranking evaluation.

FullText(HTML)

References (42)

Relative Articles

Supplements (5)

Cited By

CodeRankEval: Benchmarking and Analyzing LLM Performance for Code Ranking

Abstract

Catalog

Export File

Citation

Format

Content