SeeUPO：具备收敛保证的序列级智能体强化学习算法

摘要

强化学习（RL）已成为训练基于大语言模型（LLM）的智能代理的主流范式。然而，现有骨干RL算法在代理场景中缺乏经过验证的收敛保证，尤其在多轮交互设置下，这会导致训练不稳定及无法收敛至最优策略。本文系统分析了策略更新机制与优势估计方法的不同组合在单轮/多轮场景中的收敛特性。研究发现：采用组相对优势估计（GRAE）的REINFORCE算法在无折扣条件下可收敛至全局最优，但PPO与GRAE的组合会破坏PPO原有的单调改进特性。此外，我们证明主流骨干RL算法在多轮场景中无法同时实现无评论员架构与收敛保证。为此，我们提出SeeUPO（序列级顺序更新策略优化），这是一种具备收敛保证的无评论员方法，适用于多轮交互。SeeUPO将多轮交互建模为顺序执行的多智能体赌博机问题，通过逆序逐轮顺序更新策略，借助逆向归纳法确保单调改进并收敛至全局最优解。在AppWorld和BFCL v4上的实验表明，SeeUPO相较现有骨干算法取得显著提升：Qwen3-14B模型相对增益达43.3%-54.6%，Qwen2.5-14B模型相对增益达24.1%-41.9%（基准测试平均值），同时展现出更优的训练稳定性。

English

Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in multi-turn settings, which can lead to training instability and failure to converge to optimal policies. In this paper, we systematically analyze how different combinations of policy update mechanisms and advantage estimation methods affect convergence properties in single/multi-turn scenarios. We find that REINFORCE with Group Relative Advantage Estimation (GRAE) can converge to the globally optimal under undiscounted conditions, but the combination of PPO & GRAE breaks PPO's original monotonic improvement property. Furthermore, we demonstrate that mainstream backbone RL algorithms cannot simultaneously achieve both critic-free and convergence guarantees in multi-turn scenarios. To address this, we propose SeeUPO (Sequence-level Sequential Update Policy Optimization), a critic-free approach with convergence guarantees for multi-turn interactions. SeeUPO models multi-turn interaction as sequentially executed multi-agent bandit problems. Through turn-by-turn sequential policy updates in reverse execution order, it ensures monotonic improvement and convergence to global optimal solution via backward induction. Experiments on AppWorld and BFCL v4 demonstrate SeeUPO's substantial improvements over existing backbone algorithms: relative gains of 43.3%-54.6% on Qwen3-14B and 24.1%-41.9% on Qwen2.5-14B (averaged across benchmarks), along with superior training stability.

SeeUPO：具备收敛保证的序列级智能体强化学习算法

SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees

摘要

Support