ChatPaper.aiChatPaper

Turn-PPO:基于回合级优势估计的PPO算法提升智能体大语言模型多轮强化学习效果

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

December 18, 2025
作者: Junbo Li, Peng Zhou, Rui Meng, Meet P. Vadera, Lihong Li, Yang Li
cs.AI

摘要

强化学习(RL)已成为在真实环境中训练交互式大语言模型(LLM)智能体的天然方法。然而,直接将广泛使用的组相对策略优化(GRPO)算法应用于多轮次任务时,会暴露出明显局限性,尤其在需要长程推理的场景中。为解决这些挑战,我们研究了更稳定有效的优势估计策略,特别针对多轮交互设置。我们首先探索了近端策略优化(PPO)作为替代方案,发现其比GRPO更具鲁棒性。为进一步增强PPO在多轮场景中的表现,我们提出了轮次级PPO(turn-PPO)——一种基于轮次级马尔可夫决策过程(MDP)建模的变体,与常用的令牌级MDP形成对比。在WebShop和Sokoban数据集上的实验结果表明,无论是否包含长推理组件,轮次级PPO均表现出卓越效能。
English
Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.
PDF91December 23, 2025