超越LLM強化學習中的統一Token級信任區域

摘要

具有可验证奖励的强化学习（RLVR）已成为提升大语言模型推理能力的标准方法。然而，现有的PPO风格信任区域机制仍保持位置无关性——对所有词元独立施加统一阈值。这种逐点处理方式与自回归生成在两个方面存在根本性冲突。首先，统一阈值忽略了自回归的不对称性。早期阶段的偏差会引发序列级累积漂移，导致静态阈值对早期发散约束不足，却过度限制后期探索。其次，孤立评估词元级发散会忽视累积前缀漂移，使得无论条件历史已偏离当前部署策略多远，系统仍给予相同的发散容忍度。为解决这一局限，我们提出CPPO（累积前缀散度策略优化），这是一种词元级掩码规则，通过两种耦合机制使更新与有限时域策略改进界对齐：其一，位置加权阈值对早期位置（其影响持续更久）施加更严格限制，同时放宽对后期词元的约束；其二，累积前缀预算追踪历史偏差，动态限制进一步的词元级偏差，从而防止前缀路径上的误差累积。实验表明，CPPO在不同模型规模下均能提升训练稳定性，并显著提高推理准确率。

English

Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens independently. This pointwise treatment conflicts with autoregressive generation in two critical ways. First, uniform thresholds ignore autoregressive asymmetry. Early-stage deviations produce compounding sequence-level drift, causing static thresholds to under-regulate early divergence and excessively constrain late-stage exploration. Second, evaluating token-level divergence in isolation overlooks cumulative prefix drift, granting the same divergence allowance regardless of how far the conditioning history has already deviated from the rollout policy. To address this limitation, we propose CPPO (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that aligns updates with a finite-horizon policy-improvement bound via two coupled mechanisms. First, a position-weighted threshold imposes stricter limits at early positions whose effects persist longer, relaxing constraints for late-stage tokens. Second, a cumulative prefix budget tracks historical deviations, dynamically restricting further token-level deviation to prevent compounding errors along the prefix. Empirically, CPPO enhances training stability and significantly improves reasoning accuracy across various model scales.