CodeFuse-CR-Bench：一個全面性感知的基準測試，用於Python專案中的端到端程式碼審查評估

摘要

自動化程式碼審查（CR）是大型語言模型（LLMs）的一個關鍵應用，但進展受到「現實差距」的阻礙：現有的基準測試使用簡化且缺乏上下文資料來評估模型在孤立子任務上的表現。這未能反映真實世界CR中豐富的整體上下文特性。為彌合這一差距，我們引入了CodeFuse-CR-Bench，這是首個針對倉庫級CR評估的全面性感知基準。CodeFuse-CR-Bench包含來自70個Python專案的601個高品質實例，涵蓋九個Pull-Request（PR）問題領域，每個實例提供了豐富的多面向上下文，包括相關問題、PR詳情和倉庫狀態，從而實現端到端評估。除了表面指標外，我們還提出了一種新穎的評估框架，該框架結合了基於規則的位置和語法檢查與基於模型的審查品質判斷。我們首次對最先進的LLMs在這一全面CR任務上進行了大規模評估。我們的結果建立了關鍵的基準，並揭示：（1）沒有一個LLM在所有CR方面都佔據主導地位；（2）Gemini 2.5 Pro達到了最高的綜合性能；（3）不同的LLM對冗餘上下文表現出不同的魯棒性。這些發現強調了全面、多維度評估的必要性，並為推進真正智能且實用的CR助手提供了可操作的見解。

English

Automated code review (CR) is a key application for Large Language Models (LLMs), but progress is hampered by a "reality gap": existing benchmarks evaluate models on isolated sub-tasks using simplified, context-poor data. This fails to reflect the holistic context-rich nature of real-world CR. To bridge this gap, we introduce CodeFuse-CR-Bench, the first comprehensiveness-aware benchmark for repository-level CR evaluation. CodeFuse-CR-Bench comprises 601 high-quality instances from 70 Python projects covering nine Pull-Request (PR) problem domains, where each instance provides rich, multi-faceted context including the associated issue, PR details, and repository state, enabling end-to-end evaluation. Beyond superficial metrics, we also propose a novel evaluation framework that combines rule-based checks for location and syntax with model-based judgments of review quality. We present the first large-scale assessment of state-of-the-art LLMs on this comprehensive CR task. Our results establish crucial baselines and reveal that (1) no single LLM dominates all aspects of CR; (2) Gemini 2.5 Pro achieves the highest comprehensive performance; and (3) different LLMs exhibit varying robustness to redundant context. These findings highlight the necessity of holistic, multi-dimensional evaluation and provide actionable insights for advancing truly intelligent yet practical CR assistants.

CodeFuse-CR-Bench：一個全面性感知的基準測試，用於Python專案中的端到端程式碼審查評估

CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects

摘要

Support