AACR-Bench：基於全棧式儲存庫層級上下文的自動化程式碼審查評估框架

摘要

高品質的評估基準對於在自動化程式碼審查中部署大型語言模型至關重要。然而，現有基準存在兩個關鍵侷限性：首先，在儲存庫層級情境中缺乏多語言支援，限制了評估結果的普適性；其次，依賴從原始拉取請求評論中提取的嘈雜且不完整的真實標註，制約了問題檢測的範圍。為解決這些挑戰，我們推出AACR-Bench——一個提供跨程式語言完整跨文件情境的綜合基準。與傳統資料集不同，AACR-Bench採用「AI輔助、專家驗證」的標註流程，能發現原始PR中常被忽略的潛在缺陷，使缺陷覆蓋率提升285%。基於AACR-Bench對主流LLMs的廣泛評估顯示，過往評估可能因資料侷限性而誤判或僅部分反映模型能力。本研究為ACR評估建立更嚴謹的標準，並提出LLM-based ACR的新見解：情境的粒度/層級與檢索方法的選擇會顯著影響ACR性能，且這種影響因LLM類型、程式語言及LLM使用模式而異。我們的評估集程式碼、資料及其他成果已開源於https://github.com/alibaba/aacr-bench。

English

High-quality evaluation benchmarks are pivotal for deploying Large Language Models (LLMs) in Automated Code Review (ACR). However, existing benchmarks suffer from two critical limitations: first, the lack of multi-language support in repository-level contexts, which restricts the generalizability of evaluation results; second, the reliance on noisy, incomplete ground truth derived from raw Pull Request (PR) comments, which constrains the scope of issue detection. To address these challenges, we introduce AACR-Bench a comprehensive benchmark that provides full cross-file context across multiple programming languages. Unlike traditional datasets, AACR-Bench employs an "AI-assisted, Expert-verified" annotation pipeline to uncover latent defects often overlooked in original PRs, resulting in a 285% increase in defect coverage. Extensive evaluations of mainstream LLMs on AACR-Bench reveal that previous assessments may have either misjudged or only partially captured model capabilities due to data limitations. Our work establishes a more rigorous standard for ACR evaluation and offers new insights on LLM based ACR, i.e., the granularity/level of context and the choice of retrieval methods significantly impact ACR performance, and this influence varies depending on the LLM, programming language, and the LLM usage paradigm e.g., whether an Agent architecture is employed. The code, data, and other artifacts of our evaluation set are available at https://github.com/alibaba/aacr-bench .

AACR-Bench：基於全棧式儲存庫層級上下文的自動化程式碼審查評估框架

AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context

摘要

Support