A^2TGPO：基于自适应轮次裁剪的智能体轮次分组策略优化

摘要

针对智能体大语言模型（LLM）的强化学习通常依赖于稀疏的轨迹级结果奖励，这导致难以评估多轮交互中单个工具调用的贡献。现有的过程功劳分配方法要么依赖引入额外消耗的外部过程奖励模型，要么采用基于树结构的推演方法——后者仅能重新分配结果信号却限制了轨迹多样性。一种有前景的替代方案是利用策略对真实结果预测概率的逐轮变化（称为信息增益）作为内在过程信号，无需外部评估器。然而，现有在RL训练循环中利用IG信号的研究面临三个系统性挑战：跨异质位置语境的轮次归一化会扭曲单轮贡献的相对评估；可变项数的累积导致优势值随轨迹深度漂移；固定裁剪范围对IG信号差异巨大的轮次实施相同策略更新。本文提出A²TGPO（基于自适应轮次裁剪的智能体轮组策略优化），保留IG作为内在信号但重新设计其归一化、累积与使用机制：（i）轮组归一化：在每组（提示，轮次索引）内部对IG归一化，使每轮仅与同交互深度的轮次比较；（ii）方差重缩放折扣累积：将累积归一化IG除以累计项数的平方根，保持不同轮次位置的优势值可比性；（iii）自适应轮次裁剪：根据归一化IG动态调整每轮裁剪范围，对信息量大的轮次拓宽更新区域，对信息量小的轮次收窄更新范围。

English

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy's predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A^2TGPO (Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i) turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii) variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii) adaptive turn-level clipping: modulates each turn's clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.

A^2TGPO：基于自适应轮次裁剪的智能体轮次分组策略优化

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

摘要

Support