長文脈・多ターンのソフトウェア工学エージェントの訓練における強化学習の適用

要旨

強化学習（RL）を大規模言語モデル（LLM）に応用する研究は、これまで主に数学的推論や単発のコード生成などの単一ターン問題に焦点が当てられてきた。これらの問題はトークンレベルの多ターンMDP（マルコフ決定過程）として見なすことができるが、この見方は環境がフィードバックを提供しないという特殊な多ターン相互作用のケースに相当する。これは、ソフトウェア工学（SWE）などの多くの現実世界の領域とは対照的であり、これらの領域では、各アクションに対して非自明な観測を返すステートフルな環境との豊富な多ターン相互作用が要求される。このギャップを埋めるため、我々はRLをこの一般的な領域に適用する成功例を示す。修正版のDecoupled Advantage Policy Optimization（DAPO）アルゴリズムを使用し、Qwen2.5-72B-Instructを基にしたエージェントを訓練して、現実世界のソフトウェア工学タスクを解決する。我々のアプローチにより、SWE-bench Verifiedベンチマークにおけるエージェントの成功率が、20%のリジェクトファインチューニングされたベースラインから39%に向上し、教師モデルに依存することなく達成された。SWE-rebenchでは、我々のエージェントはDeepSeek-V3-0324やQwen3-235B-A22Bなどの主要なオープンウェイトモデルと同等またはそれ以上の性能を示し、同一のスキャフォールディングを使用して、複雑な現実世界の問題に対応するより強力な自律エージェントを構築するための実現可能な道筋を提供する。

English

Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent's success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models.

長文脈・多ターンのソフトウェア工学エージェントの訓練における強化学習の適用

Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning

要旨

Support