软件工程智能体中的情境化验证器：代理式评估准则

摘要

验证对于提升智能体性能至关重要：它既为强化学习提供奖励信号，又能通过测试时扩展（TTS）实现推理阶段的性能增益。尽管验证在软件工程（SWE）智能体场景中具有重要地位，现有方法却多依赖代码执行，而环境搭建的开销使得该方法难以规模化。虽然存在补丁分类器和启发式方法等可扩展替代方案，但这些方法缺乏代码库上下文支撑且可解释性较差。为此，我们提出智能体化评估准则方案：专家智能体通过交互式分析代码库生成基于上下文的评估清单，候选补丁无需执行测试即可依此获得评分。在SWE-Bench Verified的并行TTS评估中，Qwen3-Coder-30B-A3B模型采用该方案获得54.2%的得分，Qwen3-32B模型获得40.6%的得分，较对比组最强基线提升至少3.5个百分点。进一步分析表明，评估准则评分与真实测试结果高度一致，同时能标记出测试未能覆盖的问题。消融实验证实，智能体上下文收集机制对生成代码库专属的明确评判标准具有关键作用。这些结果表明，智能体化评估准则为SWE智能体提供了高效、可扩展且细粒度的验证信号。

English

Verification is critical for improving agents: it provides the reward signal for Reinforcement Learning and enables inference-time gains through Test-Time Scaling (TTS). Despite its importance, verification in software engineering (SWE) agent settings often relies on code execution, which can be difficult to scale due to environment setup overhead. Scalable alternatives such as patch classifiers and heuristic methods exist, but they are less grounded in codebase context and harder to interpret. To this end, we explore Agentic Rubrics: an expert agent interacts with the repository to create a context-grounded rubric checklist, and candidate patches are then scored against it without requiring test execution. On SWE-Bench Verified under parallel TTS evaluation, Agentic Rubrics achieve a score of 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B, with at least a +3.5 percentage-point gain over the strongest baseline in our comparison set. We further analyze rubric behavior, showing that rubric scores are consistent with ground-truth tests while also flagging issues that tests do not capture. Our ablations show that agentic context gathering is essential for producing codebase-specific, unambiguous criteria. Together, these results suggest that Agentic Rubrics provide an efficient, scalable, and granular verification signal for SWE agents.

软件工程智能体中的情境化验证器：代理式评估准则

Agentic Rubrics as Contextual Verifiers for SWE Agents

摘要

Support