CodeRankEval: Benchmarking and Analyzing LLM Performance for Code Ranking
-
Li-Guo Chen,
-
Zheng Xiao,
-
Yi-Jiang Xu,
-
Rui-Chuan An,
-
Xin Wang,
-
Yang-Ning Li,
-
Ying-Hui Li,
-
Yi-Dong Wang,
-
Zheng-Ran Zeng,
-
Qing Gao,
-
Shi-Kun Zhang
-
Abstract
Large language models (LLMs) are increasingly applied across diverse software engineering tasks. Consequently, their ability to effectively rank code quality is crucial for applications like selecting optimal solutions and aiding code review. However, evaluating this essential code ranking capability is hampered by a lack of benchmarks covering diverse paradigms and robustness testing. To address this, we introduce CodeRankEval, a benchmark suite for multi-paradigm evaluation, and CodeRankEval-Perturbed for robustness testing against common code flaws. Our empirical study reveals key insights: pairwise ranking yields the highest accuracy but is costly; listwise is cheapest and shows comparable performance with
pairwise; pointwise generally exhibits lower performance with intermediate cost. Besides, ranking ability correlates positively with generation ability, models show reasonable robustness to perturbations but can exhibit positional bias. Overall, this work provides valuable resources and insights for understanding and improving LLM-based code ranking evaluation.
-
-