利用強化學習訓練長上下文、多輪軟體工程代理

摘要

强化学习（RL）在大型语言模型（LLMs）中的应用研究主要集中在单轮问题上，例如数学推理或单次代码生成。虽然这些问题可以被视为令牌级别的多轮马尔可夫决策过程（MDPs），但这种视角对应于多轮交互的退化情况，即环境不提供任何反馈。这与许多现实世界领域形成鲜明对比，例如软件工程（SWE），这些领域需要与有状态的环境进行丰富的多轮交互，环境对每个动作都会做出非平凡的观察。为了弥合这一差距，我们展示了RL在这一通用领域中的成功应用。通过改进的解耦优势策略优化（DAPO）算法，我们训练了一个基于Qwen2.5-72B-Instruct的代理来解决现实世界的软件工程任务。我们的方法将代理在SWE-bench Verified基准上的成功率从20%的拒绝微调基线提高到39%，且不依赖任何教师模型。在SWE-rebench上，我们的代理在相同的框架下匹配或超越了领先的开源权重模型，如DeepSeek-V3-0324和Qwen3-235B-A22B，为基于开源模型构建更强大的自主代理以解决复杂现实问题提供了可行的路径。

English

Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent's success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models.

利用強化學習訓練長上下文、多輪軟體工程代理

Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning

摘要

Support