超越统一Token级别信任区域的大语言模型强化学习

摘要

基于可验证奖励的强化学习（RLVR）已成为提升大语言模型推理能力的标准方法。然而，现有的PPO风格信任域机制仍然是位置无关的，它对所有令牌独立施加统一的阈值。这种逐点处理方式在两个方面与自回归生成存在根本冲突。首先，统一阈值忽略了自回归不对称性。早期偏差会产生累积的序列级漂移，导致静态阈值对早期散度调控不足，而对后期探索约束过强。其次，孤立地评估令牌级散度会忽略累积前缀漂移，无论条件历史已偏离展开策略多远，都给予相同的散度允许量。为解决这一局限性，我们提出了CPPO（累积前缀散度策略优化），这是一种令牌级遮蔽规则，通过两个耦合机制将更新与有限时域策略改进界对齐。首先，位置加权阈值对影响持续时间更长的早期位置施加更严格的限制，同时放松对后期令牌的约束。其次，累积前缀预算追踪历史偏差，动态限制进一步的令牌级偏差，以防止沿前缀产生累积误差。实验证明，CPPO增强了训练稳定性，并在各种模型规模下显著提高了推理准确率。

English

Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens independently. This pointwise treatment conflicts with autoregressive generation in two critical ways. First, uniform thresholds ignore autoregressive asymmetry. Early-stage deviations produce compounding sequence-level drift, causing static thresholds to under-regulate early divergence and excessively constrain late-stage exploration. Second, evaluating token-level divergence in isolation overlooks cumulative prefix drift, granting the same divergence allowance regardless of how far the conditioning history has already deviated from the rollout policy. To address this limitation, we propose CPPO (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that aligns updates with a finite-horizon policy-improvement bound via two coupled mechanisms. First, a position-weighted threshold imposes stricter limits at early positions whose effects persist longer, relaxing constraints for late-stage tokens. Second, a cumulative prefix budget tracks historical deviations, dynamically restricting further token-level deviation to prevent compounding errors along the prefix. Empirically, CPPO enhances training stability and significantly improves reasoning accuracy across various model scales.