Turn-PPO:基于回合级优势估计的PPO算法,提升智能体大语言模型的多轮强化学习效果
Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs
December 18, 2025
作者: Junbo Li, Peng Zhou, Rui Meng, Meet P. Vadera, Lihong Li, Yang Li
cs.AI
摘要
強化學習(RL)在訓練現實環境中的互動式大語言模型智能體時,已重新成為一種自然的研究路徑。然而,直接將廣泛使用的群組相對策略優化(GRPO)算法應用於多輪次任務時,會暴露出明顯的局限性,尤其是在需要長程推理的場景中。為應對這些挑戰,我們研究了更穩定有效的優勢估計策略,特別針對多輪次設定。我們首先探索近端策略優化(PPO)作為替代方案,發現其相比GRPO具有更強的魯棒性。為進一步提升PPO在多輪次場景中的表現,我們提出了輪次級PPO(turn-PPO)——這種變體算法基於輪次級馬爾可夫決策過程(MDP)建模,而非常用的詞元級MDP框架。我們在WebShop和Sokoban數據集上的實驗結果表明,無論是否包含長推理組件,輪次級PPO均能展現卓越效能。
English
Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.