SWE-RM: ソフトウェア工学エージェントのための実行不要フィードバック

要旨

ユニットテストのような実行ベースのフィードバックは、テスト時スケーリング（TTS）や強化学習（RL）を用いたコーディングエージェントの開発において広く利用されている。このパラダイムでは、正確なフィードバックを提供するために、スケーラブルで信頼性の高いユニットテストケースの収集が求められ、その結果得られるフィードバックはしばしばスパースであり、両方成功または両方失敗である軌跡を効果的に区別できない。対照的に、報酬モデルからの実行不要なフィードバックは、ユニットテストケースに依存せず、より細かな信号を提供できる。この可能性にもかかわらず、現実的なソフトウェアエンジニアリング（SWE）エージェント向けの実行不要フィードバックの研究は未だ不十分である。TTSとRLの両方で有効な汎用性の高い報酬モデルの開発を目指す中で、我々は、TTS性能がほぼ同一である2つの検証器が、RLでは非常に異なる結果をもたらし得ることを観察した。直感的には、TTSは主にモデルが最良の軌跡を選択する能力を反映するが、この能力は必ずしもRLに一般化するわけではない。この制限に対処するため、我々はRLトレーニングにおいて重要な2つの追加側面、すなわち分類精度と較正を特定した。そして、これらの指標全体で良好に機能する頑健な報酬モデルを訓練する方法を調査するため、包括的な制御実験を実施した。特に、訓練データ規模、ポリシーの混合、データソースの構成など、様々な要因の影響を分析した。これらの調査に基づき、我々はSWE-RMを導入する。これは、合計300億パラメータ、推論時に30億パラメータが活性化されるエキスパートの混合アーキテクチャを採用した、正確かつ頑健な報酬モデルである。SWE-RMは、TTSとRLの両方の性能においてSWEエージェントを大幅に改善する。例えば、SWE-Bench Verifiedにおいて、TTSを使用した場合、Qwen3-Coder-Flashの精度を51.6%から62.0%に、Qwen3-Coder-Maxの精度を67.0%から74.6%に向上させ、オープンソースモデルの中で新たなstate-of-the-art性能を達成した。

English

Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the model's ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition. Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models.

SWE-RM: ソフトウェア工学エージェントのための実行不要フィードバック

SWE-RM: Execution-free Feedback For Software Engineering Agents

要旨

Support