T^2PO: 安定したマルチターンエージェント強化学習のための不確実性誘導探索制御

要旨

近年、多ターン強化学習（RL）の進展により、複雑な対話型タスクにおける推論LLMの性能が大幅に向上している。細粒度の信用割り当てや軌道フィルタリングといった安定化技術が進歩したにもかかわらず、不安定性は広く蔓延しており、しばしば学習の破綻を引き起こす。我々は、この不安定性が多ターン設定における非効率な探索に起因すると主張する。すなわち、方策が不確実性を低減せず、タスクの進捗ももたらさない低情報量の行動を生成し続けるためである。この問題を解決するため、我々は不確実性を考慮したフレームワークであるToken- and Turn-level Policy Optimization（T^2PO）を提案する。これは、細粒度レベルで探索を明示的に制御するものである。トークンレベルでは、T^2POは不確実性の動態を監視し、限界不確実性変化が閾値を下回った時点で思考介入を発動する。ターンレベルでは、T^2POは探索の進展が無視できる程度の対話を特定し、そのようなターンを動的に再サンプリングすることで、ロールアウトの無駄を回避する。我々はT^2POをWebShop、ALFWorld、Search QAといった多様な環境で評価し、より優れた探索効率により、学習の安定性と性能が大幅に向上することを実証した。コードはhttps://github.com/WillDreamer/T2PO で公開されている。

English

Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where policies continue to generate low-information actions that neither reduce uncertainty nor advance task progress. To address this issue, we propose Token- and Turn-level Policy Optimization (T^2PO), an uncertainty-aware framework that explicitly controls exploration at fine-grained levels. At the token level, T^2PO monitors uncertainty dynamics and triggers a thinking intervention once the marginal uncertainty change falls below a threshold. At the turn level, T^2PO identifies interactions with negligible exploration progress and dynamically resamples such turns to avoid wasted rollouts. We evaluate T^2PO in diverse environments, including WebShop, ALFWorld, and Search QA, demonstrating substantial gains in training stability and performance improvements with better exploration efficiency. Code is available at: https://github.com/WillDreamer/T2PO.

T^2PO: 安定したマルチターンエージェント強化学習のための不確実性誘導探索制御

T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

要旨

Support