SeeUPO: 수렴 보장이 있는 시퀀스 수준 행위자 강화학습

초록

강화학습(RL)은 대규모 언어 모델(LLM) 기반 AI 에이전트 훈련을 위한 주요 패러다임으로 부상했습니다. 그러나 기존의 핵심 RL 알고리즘들은 에이전트 시나리오, 특히 다중 턴 설정에서 검증된 수렴 보장이 부족하여 훈련 불안정성과 최적 정책으로의 수렴 실패를 초래할 수 있습니다. 본 논문에서는 단일/다중 턴 시나리오에서 정책 업데이트 메커니즘과 어드밴티지 추정 방법의 다양한 조합이 수렴 특성에 미치는 영향을 체계적으로 분석합니다. 우리는 Group Relative Advantage Estimation(GRAE)을 적용한 REINFORCE가 할인되지 않은 조건에서 전역 최적점으로 수렴할 수 있지만, PPO와 GRAE의 조합은 PPO의 원래 단조 개선 특성을 깨뜨린다는 사실을 발견했습니다. 더 나아가, 주류 핵심 RL 알고리즘들은 다중 턴 시나리오에서 비판사(critic-free)와 수렴 보장을 동시에 달성할 수 없음을 입증합니다. 이를 해결하기 위해 우리는 다중 턴 상호작용을 위한 수렴 보장이 있는 비판사 접근법인 SeeUPO(Sequence-level Sequential Update Policy Optimization)를 제안합니다. SeeUPO는 다중 턴 상호작용을 순차적으로 실행되는 다중 에이전트 bandit 문제로 모델링합니다. 역실행 순서로 턴별 순차적 정책 업데이트를 통해, 역진 귀납법(backward induction)을 통한 단조 개선과 전역 최적 해로의 수렴을 보장합니다. AppWorld 및 BFCL v4에서의 실험은 SeeUPO가 기존 핵심 알고리즘 대비 상당한 향상을 보여줍니다: Qwen3-14B 기준 43.3%-54.6%, Qwen2.5-14B 기준 24.1%-41.9%의 상대적 성능 향상(벤치마크 평균)과 더불어 우수한 훈련 안정성을 확인했습니다.

English

Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in multi-turn settings, which can lead to training instability and failure to converge to optimal policies. In this paper, we systematically analyze how different combinations of policy update mechanisms and advantage estimation methods affect convergence properties in single/multi-turn scenarios. We find that REINFORCE with Group Relative Advantage Estimation (GRAE) can converge to the globally optimal under undiscounted conditions, but the combination of PPO & GRAE breaks PPO's original monotonic improvement property. Furthermore, we demonstrate that mainstream backbone RL algorithms cannot simultaneously achieve both critic-free and convergence guarantees in multi-turn scenarios. To address this, we propose SeeUPO (Sequence-level Sequential Update Policy Optimization), a critic-free approach with convergence guarantees for multi-turn interactions. SeeUPO models multi-turn interaction as sequentially executed multi-agent bandit problems. Through turn-by-turn sequential policy updates in reverse execution order, it ensures monotonic improvement and convergence to global optimal solution via backward induction. Experiments on AppWorld and BFCL v4 demonstrate SeeUPO's substantial improvements over existing backbone algorithms: relative gains of 43.3%-54.6% on Qwen3-14B and 24.1%-41.9% on Qwen2.5-14B (averaged across benchmarks), along with superior training stability.

SeeUPO: 수렴 보장이 있는 시퀀스 수준 행위자 강화학습

SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees

초록

Support