SWE-RM: 소프트웨어 엔지니어링 에이전트를 위한 실행 없는 피드백

초록

단위 테스트와 같은 실행 기반 피드백은 테스트 시간 스케일링(TTS) 및 강화 학습(RL)을 통한 코딩 에이전트 개발에 널리 사용됩니다. 이러한 패러다임은 정확한 피드백을 제공하기 위해 확장 가능하고 신뢰할 수 있는 단위 테스트 케이스 수집을 요구하며, 결과적인 피드백은 희소한 경우가 많아 성공한轨迹(trajectory)들 간 또는 실패한轨迹들 간을 효과적으로 구분하지 못합니다. 이와 대조적으로, 보상 모델로부터의 실행 불필요 피드백(execution-free feedback)은 단위 테스트 케이스에 의존하지 않으면서 더 세분화된 신호를 제공할 수 있습니다. 이러한 잠재력에도 불구하고, 현실적인 소프트웨어 공학(SWE) 에이전트를 위한 실행 불필요 피드백은 아직 충분히 탐구되지 않았습니다. 그러나 TTS와 RL 모두에서 효과적인 다목적 보상 모델을 개발하고자 할 때, TTS 성능이 거의 동일한 두 검증기(verifier)가 RL에서는 매우 다른 결과를 낼 수 있음을 관찰했습니다. 직관적으로 TTS는 주로 모델의 최적轨迹 선택 능력을 반영하지만, 이 능력이 반드시 RL로 일반화되지는 않습니다. 이러한 한계를 해결하기 위해 우리는 RL 훈련에至关重要的한 두 가지 추가 측면, 즉 분류 정확도(classification accuracy)와 캘리브레이션(calibration)을 식별했습니다. 그런 다음 이러한 메트릭 전반에서 우수한 성능을 보이는 강력한 보상 모델을 훈련시키는 방법을 탐구하기 위해 포괄적인 통제 실험을 수행합니다. 특히 훈련 데이터 규모, 정책 혼합(policy mixtures), 데이터 소스 구성 등 다양한 요소들의 영향을 분석합니다. 이러한 탐구를 바탕으로 우리는 총 300억 개의 파라미터를 가지며 추론 시 30억 개가 활성화되는 전문가 혼합(Mixture-of-Experts) 아키텍처를 채택한 정확하고 강력한 보상 모델인 SWE-RM을 소개합니다. SWE-RM은 TTS와 RL 성능 모두에서 SWE 에이전트를 크게 향상시킵니다. 예를 들어, SWE-Bench Verified에서 TTS를 사용할 때 Qwen3-Coder-Flash의 정확도를 51.6%에서 62.0%로, Qwen3-Coder-Max의 정확도를 67.0%에서 74.6%로 높여 오픈소스 모델 중 새로운 최첨단 성능을 달성합니다.

English

Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the model's ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition. Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models.

SWE-RM: 소프트웨어 엔지니어링 에이전트를 위한 실행 없는 피드백

SWE-RM: Execution-free Feedback For Software Engineering Agents

초록

Support