CodeFuse-CR-Bench:面向Python项目的端到端代码评审全面性评估基准
CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects
September 18, 2025
作者: Hanyang Guo, Xunjin Zheng, Zihan Liao, Hang Yu, Peng DI, Ziyin Zhang, Hong-Ning Dai
cs.AI
摘要
自动化代码审查(CR)是大语言模型(LLMs)的关键应用之一,但其进展受到“现实差距”的阻碍:现有基准测试在简化且缺乏上下文的数据上评估模型的孤立子任务。这未能反映现实世界CR中丰富的整体上下文。为弥合这一差距,我们推出了CodeFuse-CR-Bench,这是首个面向仓库级CR评估的全面性感知基准。CodeFuse-CR-Bench包含来自70个Python项目的601个高质量实例,涵盖九个拉取请求(PR)问题领域,每个实例均提供包括相关议题、PR详情及仓库状态在内的多维度丰富上下文,支持端到端评估。除表面指标外,我们还提出了一种新颖的评估框架,结合基于规则的位置与语法检查与基于模型的审查质量判断。我们首次对最先进的LLMs在这一综合CR任务上进行了大规模评估。我们的结果确立了关键基线,并揭示:(1)没有单一LLM在CR的所有方面均占优;(2)Gemini 2.5 Pro在综合性能上表现最佳;(3)不同LLM对冗余上下文展现出不同的鲁棒性。这些发现强调了进行整体、多维度评估的必要性,并为推进真正智能且实用的CR助手提供了可操作的洞见。
English
Automated code review (CR) is a key application for Large Language Models
(LLMs), but progress is hampered by a "reality gap": existing benchmarks
evaluate models on isolated sub-tasks using simplified, context-poor data. This
fails to reflect the holistic context-rich nature of real-world CR. To bridge
this gap, we introduce CodeFuse-CR-Bench, the first comprehensiveness-aware
benchmark for repository-level CR evaluation. CodeFuse-CR-Bench comprises 601
high-quality instances from 70 Python projects covering nine Pull-Request (PR)
problem domains, where each instance provides rich, multi-faceted context
including the associated issue, PR details, and repository state, enabling
end-to-end evaluation. Beyond superficial metrics, we also propose a novel
evaluation framework that combines rule-based checks for location and syntax
with model-based judgments of review quality. We present the first large-scale
assessment of state-of-the-art LLMs on this comprehensive CR task. Our results
establish crucial baselines and reveal that (1) no single LLM dominates all
aspects of CR; (2) Gemini 2.5 Pro achieves the highest comprehensive
performance; and (3) different LLMs exhibit varying robustness to redundant
context. These findings highlight the necessity of holistic, multi-dimensional
evaluation and provide actionable insights for advancing truly intelligent yet
practical CR assistants.