近未来方策最適化

要旨

検証可能な報酬を用いた強化学習（RLVR）は、ポストトレーニングの核心的な手法として確立されつつある。オンポリシー探索に適切なオフポリシー軌跡を導入することで、RLVRの収束が加速され性能上限が引き上げられるが、そのような軌跡の供給源を見つけることが主要な課題となっている。既存の混合ポリシー手法は、外部の教師から軌跡を輸入する（高品質だが分布が遠い）か、過去の訓練軌跡を再生する（分布は近いが品質に上限がある）かのいずれかであり、有効学習信号S=Q/Vを最大化するために必要な「十分に強力（Q値が高く、学習すべき新知識が多い）」かつ「十分に近接（V値が低く、吸収が容易）」という条件を同時に満たすものはない。本研究では、Near-Future Policy Optimization（NPO）を提案する。これはポリシー自身の近未来バージョンから学習する簡潔な混合ポリシー手法であり、同一訓練ラン内の後期チェックポイントが、現在のポリシーより強力かつ外部源より分布が近い補助軌跡の自然な供給源となることで、軌跡品質と分散コストのバランスを直接最適化する。NPOを、初期段階のブートストラップと後期段階の高原状態突破という2つの人為的介入で検証し、さらにオンライン訓練信号から自動的に介入をトリガーし、Sを最大化するガイドチェックポイントを選択する適応型変種AutoNPOを提案する。Qwen3-VL-8B-InstructとGRPOを用いた実験では、NPOが平均性能を57.88から62.84に向上させ、AutoNPOは63.15まで押し上げ、収束を加速しつつ最終性能上限を引き上げることを実証した。

English

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher Q , more new knowledge to learn) and close enough (lower V , more readily absorbed) conditions required to maximize the effective learning signal S = Q/V. We propose Near-Future Policy Optimization (NPO), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose AutoNPO,an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes S. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.