近期策略优化

摘要

可验证奖励强化学习（RLVR）已成为后训练阶段的核心方法。将合适的离策略轨迹引入同策略探索能加速RLVR收敛并提升性能上限，但寻找此类轨迹来源仍是关键挑战。现有混合策略方法要么引入外部教师模型的轨迹（质量高但分布差异大），要么回放历史训练轨迹（分布接近但质量受限），均无法同时满足最大化有效学习信号S=Q/V所需的双重条件：足够强（更高Q值，蕴含更多新知识）与足够近（更低V值，更易被吸收）。我们提出近未来策略优化（NPO），该简易混合策略方案通过向策略自身的近未来版本学习：从同一训练流程中选取稍晚的检查点作为辅助轨迹源，其既强于当前策略又比任何外部源更接近，直接平衡了轨迹质量与方差代价。我们通过早期快速启动和晚期平台突破两项人工干预验证NPO，并进一步提出自适应变体AutoNPO——它能根据在线训练信号自动触发干预，并选择使S最大化的引导检查点。在Qwen3-VL-8B-Instruct模型与GRPO配合下，NPO将平均性能从57.88提升至62.84，AutoNPO进一步推高至63.15，在加速收敛的同时突破了最终性能上限。

English

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher Q , more new knowledge to learn) and close enough (lower V , more readily absorbed) conditions required to maximize the effective learning signal S = Q/V. We propose Near-Future Policy Optimization (NPO), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose AutoNPO,an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes S. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.