SWE-RM：面向软件工程智能体的免执行反馈机制

摘要

基于单元测试等执行反馈机制通过测试时扩展和强化学习被广泛应用于代码智能体的开发。该范式需要可扩展且可靠的测试用例收集以提供准确反馈，但由此产生的反馈往往具有稀疏性，无法有效区分同为成功或失败的执行轨迹。相比之下，基于奖励模型的免执行反馈能提供更细粒度的信号，且不依赖于测试用例。尽管具有这种潜力，针对现实软件工程智能体的免执行反馈研究仍显不足。在尝试开发适用于测试时扩展和强化学习的通用奖励模型时，我们发现两个验证器在测试时扩展中表现相近，却在强化学习中产生显著差异。直观而言，测试时扩展主要反映模型选择最优轨迹的能力，但这种能力未必能迁移到强化学习场景。为解决这一局限，我们识别出对强化学习训练至关重要的两个附加维度：分类准确度与校准度。通过系统性的对照实验，我们探索了如何训练能在这三个指标上均表现优异的鲁棒奖励模型，重点分析了训练数据规模、策略混合方案及数据源构成等因素的影响。基于这些研究，我们提出SWE-RM奖励模型——采用专家混合架构，总参数量达300亿，推理时激活30亿参数。该模型显著提升了软件工程智能体在测试时扩展和强化学习中的表现：在SWE-Bench Verified基准上，使用测试时扩展将Qwen3-Coder-Flash的准确率从51.6%提升至62.0%，Qwen3-Coder-Max从67.0%提升至74.6%，创造了开源模型的新标杆。

English

Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the model's ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition. Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models.

SWE-RM：面向软件工程智能体的免执行反馈机制

SWE-RM: Execution-free Feedback For Software Engineering Agents

摘要

Support