多轮代理强化学习实践指南
A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning
October 1, 2025
作者: Ruiyi Wang, Prithviraj Ammanabrolu
cs.AI
摘要
我们研究了在多轮强化学习中训练大型语言模型作为智能体时,哪些方法真正有效,哪些则不然。尽管进展迅速,现有框架和定义仍显零散,缺乏对跨任务设计选择重要性的系统化阐述与分析。为填补这一空白,我们首先将设计空间分解为三个相互关联的支柱——环境、奖励和策略,并通过实证研究提炼出在特定文本领域中训练LLM智能体的方法。具体而言,我们测试了TextWorld和ALFWorld这两个用于检验具身推理的流行领域,以及SWE-Gym以应对更多软件工程风格的任务。(i) 在环境方面,我们分析了任务复杂度对状态空间和动作空间大小以及最优解长度的影响,发现即使领域内简单的环境也能提供智能体向更复杂任务泛化能力的信号。(ii) 对于奖励,我们探讨了相对奖励稀疏性的影响,观察到虽然密集的回合级奖励能加速训练,但性能和稳定性高度依赖于所选的RL算法。(iii) 在智能体策略方面,我们探索了奖励稀疏性与有偏(PPO、GRPO)和无偏(RLOO)策略梯度方法之间的相互作用,并展示了在固定预算下如何找到监督微调(SFT)与RL训练的最佳比例。我们将这些发现提炼成一套训练指南,指导三个支柱的协同设计,促进多轮智能体RL的研究与实践。代码:https://github.com/pearls-lab/meow-tea-taro
English
We study what actually works and what doesn't for training large language
models as agents via multi-turn reinforcement learning. Despite rapid progress,
existing frameworks and definitions are fragmented, and there is no systematic
formulation or analysis of which design choices matter across tasks. We address
this gap by first breaking down the design space into three inter-related
pillars -- environment, reward, and policy -- and empirically derive a recipe
for training LLM agents in situated textual domains. In particular, we test
TextWorld and ALFWorld, popular domains for testing situated embodied
reasoning, as well as SWE-Gym for more software engineering style tasks. (i)
For the environment, we analyze the impacts of task complexity in terms of
sizes of the state and action spaces as well as optimal solution length,
finding that even simple environments within a domain can provide signal on how
well an agent can generalize to more complex tasks. (ii) For the reward, we
ablate relative reward sparsity, observing that while dense turn-level rewards
accelerate training, performance and stability is highly dependent on the
choice of RL algorithm. (iii) And for the agent's policy, we explore the
interplay between reward sparsity and biased (PPO, GRPO) and unbiased (RLOO)
policy gradient methods in addition to showing how to find the optimal
Supervised Fine-tuning (SFT) to RL training ratio given a fixed budget. We
distill these findings into a training recipe that guides co-design across the
three pillars, facilitating research and practical efforts in multi-turn
agentic RL. Code: https://github.com/pearls-lab/meow-tea-taro