A^2TGPO：基于自适应轮次裁剪的智能体轮次分组策略优化

摘要

面向智能体大语言模型（LLM）的强化学习通常依赖稀疏的轨迹级结果奖励，这导致难以评估多轮交互中单个工具调用的贡献。现有的过程信用分配方法要么依赖引入额外消耗的外部过程奖励模型，要么采用基于树结构的推演方法——后者仅能重新分配结果信号却限制了轨迹多样性。一种有前景的替代方案是利用策略对真实结果预测概率的逐轮变化（称为信息增益IG）作为无需外部评估器的内在过程信号。然而，现有在RL训练循环中利用IG信号的研究面临三个系统性挑战：面对异质位置上下文的轮次间归一化可能扭曲单轮次相对排名；可变项数累加导致优势值随轨迹深度漂移；固定裁剪范围对IG信号差异巨大的轮次采用相同策略更新方式。本文提出A²TGPO（基于自适应轮次裁剪的智能体轮次组策略优化），保留IG作为内在信号但重新设计其归一化、累加和消耗机制：（i）轮次组归一化：在每组（提示，轮次索引）内对IG归一化，使每轮仅与同交互深度的轮次比较；（ii）方差重缩放折扣累加：将累计归一化IG除以累加项数的平方根，保持不同轮次位置的优势值可比性；（iii）自适应轮次裁剪：根据归一化IG动态调整每轮裁剪范围，对信息量大的轮次拓宽更新区域，对信息量小的轮次收窄更新范围。

English

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy's predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A^2TGPO (Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i) turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii) variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii) adaptive turn-level clipping: modulates each turn's clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.

A^2TGPO：基于自适应轮次裁剪的智能体轮次分组策略优化

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

摘要

Support