We use cookies to improve your experience with our site.
Chen LG, Xiao Z, Xu YJ et al. CodeRankEval: Benchmarking and analyzing LLM performance for code ranking. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 40(5): 1220−1233, Sept. 2025. DOI: 10.1007/s11390-025-5514-9
Citation: Chen LG, Xiao Z, Xu YJ et al. CodeRankEval: Benchmarking and analyzing LLM performance for code ranking. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 40(5): 1220−1233, Sept. 2025. DOI: 10.1007/s11390-025-5514-9

CodeRankEval: Benchmarking and Analyzing LLM Performance for Code Ranking

  • Large language models (LLMs) are increasingly applied across diverse software engineering tasks. Consequently, their ability to effectively rank code quality is crucial for applications like selecting optimal solutions and aiding code review. However, evaluating this essential code ranking capability is hampered by a lack of benchmarks covering diverse paradigms and robustness testing. To address this, we introduce CodeRankEval, a benchmark suite for multi-paradigm evaluation, and CodeRankEval-Perturbed for robustness testing against common code flaws. Our empirical study reveals key insights: pairwise ranking yields the highest accuracy but is costly; listwise is the cheapest and shows comparable performance with pairwise; pointwise generally exhibits lower performance with intermediate cost. Besides, ranking ability correlates positively with generation ability, models show reasonable robustness to perturbations but may exhibit positional bias. Overall, this work provides valuable resources and insights for understanding and improving LLM-based code ranking evaluation.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return