SWE-RM:面向软件工程智能体的无执行反馈机制
SWE-RM: Execution-free Feedback For Software Engineering Agents
December 26, 2025
作者: KaShun Shum, Binyuan Hui, Jiawei Chen, Lei Zhang, X. W., Jiaxi Yang, Yuzhen Huang, Junyang Lin, Junxian He
cs.AI
摘要
基于执行的反馈(如单元测试)通过测试时扩展(TTS)和强化学习(RL)被广泛应用于编码智能体的开发。这种范式需要可扩展且可靠的单元测试用例收集以提供准确反馈,但由此产生的反馈往往具有稀疏性,无法有效区分同为成功或同为失败的执行轨迹。相比之下,来自奖励模型的免执行反馈能够在不依赖单元测试用例的情况下提供更细粒度的信号。尽管具有这种潜力,针对现实软件工程(SWE)智能体的免执行反馈研究仍显不足。虽然我们的目标是开发在TTS和RL场景下均有效的通用奖励模型,但我们观察到两个验证器在TTS性能近乎相同的情况下,在RL中可能产生截然不同的结果。直观而言,TTS主要反映模型选择最佳轨迹的能力,但这种能力未必能泛化到RL场景。为突破这一局限,我们识别出对RL训练至关重要的两个附加维度:分类准确率与校准度。随后通过系统化的对照实验,探究如何训练能在这类指标上均表现优异的鲁棒奖励模型。我们重点分析了训练数据规模、策略混合方式及数据源构成等多重因素的影响。基于这些研究,我们提出了SWE-RM——一个采用专家混合架构的精准鲁棒奖励模型,其总参数量达300亿,推理时激活参数量为30亿。SWE-RM显著提升了SWE智能体在TTS和RL场景下的性能:在SWE-Bench Verified测试集上,使用TTS时将Qwen3-Coder-Flash的准确率从51.6%提升至62.0%,将Qwen3-Coder-Max从67.0%提升至74.6%,在开源模型中实现了新的最优性能。
English
Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the model's ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition. Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models.