基於代理的評分標準作為軟體工程代理的情境驗證器

摘要

驗證對於改進智能體至關重要：它為強化學習提供獎勵信號，並能通過測試時擴展（TTS）實現推理階段的效能提升。儘管如此，在軟體工程（SWE）智能體場景中，驗證通常依賴代碼執行，但由於環境設置的開銷，這種方法難以擴展。雖然存在可擴展的替代方案（如補丁分類器和啟發式方法），但這些方法與代碼庫上下文的關聯性較弱且可解釋性較差。為此，我們探索「智能體化評量標準」：由專家智能體與代碼庫互動，創建基於上下文的評量檢查表，候選補丁隨後可根據該檢查表進行評分，而無需執行測試。在並行TTS評估下的SWE-Bench Verified數據集中，智能體化評量標準在Qwen3-Coder-30B-A3B模型上達到54.2%的得分，在Qwen3-32B模型上達到40.6%的得分，較對比集中最強基線至少提升3.5個百分點。我們進一步分析評量標準的行為，發現其評分與真實測試結果一致，同時還能標記出測試未能捕捉的問題。消融實驗表明，智能體化的上下文收集對於生成代碼庫專屬的明確標準至關重要。這些結果共同表明，智能體化評量標準能為SWE智能體提供高效、可擴展且細粒度的驗證信號。

English

Verification is critical for improving agents: it provides the reward signal for Reinforcement Learning and enables inference-time gains through Test-Time Scaling (TTS). Despite its importance, verification in software engineering (SWE) agent settings often relies on code execution, which can be difficult to scale due to environment setup overhead. Scalable alternatives such as patch classifiers and heuristic methods exist, but they are less grounded in codebase context and harder to interpret. To this end, we explore Agentic Rubrics: an expert agent interacts with the repository to create a context-grounded rubric checklist, and candidate patches are then scored against it without requiring test execution. On SWE-Bench Verified under parallel TTS evaluation, Agentic Rubrics achieve a score of 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B, with at least a +3.5 percentage-point gain over the strongest baseline in our comparison set. We further analyze rubric behavior, showing that rubric scores are consistent with ground-truth tests while also flagging issues that tests do not capture. Our ablations show that agentic context gathering is essential for producing codebase-specific, unambiguous criteria. Together, these results suggest that Agentic Rubrics provide an efficient, scalable, and granular verification signal for SWE agents.

基於代理的評分標準作為軟體工程代理的情境驗證器

Agentic Rubrics as Contextual Verifiers for SWE Agents

摘要

Support