A^2TGPO：適応的ターンレベルクリッピングによるエージェンシックターングループ方策最適化

要旨

エージェンシック大規模言語モデル（LLM）の強化学習では、通常、スパースで軌道レベルの結果報酬に依存しており、マルチターン相互作用内における個々のツール呼び出しの貢献度を評価することが困難である。このような過程における信用割り当てに対する既存のアプローチは、追加のコストを導入する外部の過程報酬モデルに依存するか、あるいは軌道の多様性を制約しつつ結果信号を単に再分配するだけの木構造に基づくロールアウトに依存している。有望な代替案として、外部評価器を必要としない内在的な過程信号として、正解に対する方針の予測確率のターンごとの変化量、すなわち情報利得（IG）を利用する方法がある。しかし、RLトレーニングループ内でIG信号を活用する従来の研究は、三つの体系的な課題に直面している：異質な位置文脈に直面するターン間での正規化が個々のターンの相対的な評価を歪めうること、可変数の項を累積することでアドバンテージの大きさが軌道の深さに伴ってドリフトすること、固定されたクリップ範囲がIG信号が大きく異なるターンに対しても同一に方針更新を制御することである。本論文では、A^2TGPO（適応的ターンレベルクリッピングによるエージェンシック・ターングループ方策最適化）を提案する。この手法はIGを内在的信号として保持しつつ、その正規化、累積、消費の方法を再設計する：(i) ターングループ正規化：各（プロンプト、ターンインデックス）グループ内でIGを正規化し、各ターンが同じ相互作用深度のピアとのみ比較されるようにする；(ii) 分散再スケール割引累積：正規化された累積IGを累積項数の平方根で除算し、ターン位置を跨いでアドバンテージの大きさを比較可能に保つ；(iii) 適応的ターンレベルクリッピング：正規化されたIGに基づいて各ターンのクリップ範囲を調整し、情報量の多いターンでは更新領域を広げ、情報量の少ないターンでは狭める。

English

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy's predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A^2TGPO (Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i) turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii) variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii) adaptive turn-level clipping: modulates each turn's clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.

A^2TGPO：適応的ターンレベルクリッピングによるエージェンシックターングループ方策最適化

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

要旨

Support