ChatPaper.aiChatPaper

近期策略优化

Near-Future Policy Optimization

April 22, 2026
作者: Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Jiaqi Wang
cs.AI

摘要

具備可驗證獎勵的強化學習(RLVR)已成為後訓練階段的關鍵方案。在策略探索中引入合適的離策略軌跡能加速RLVR收斂並提升性能上限,但如何獲取此類軌跡仍是核心難題。現有的混合策略方法要麼引入外部教師的軌跡(質量高但分佈差異大),要麼重放歷史訓練軌跡(分佈接近但質量受限),兩者均無法同時滿足「強度足夠」(更高Q值,蘊含更多新知識)與「親和度足夠」(更低V值,更易吸收)這對最大化有效學習信號S=Q/V的關鍵條件。我們提出近未來策略優化(NPO),這是一種從策略自身的近未來版本中學習的簡潔混合策略框架:同一訓練流程中後期的檢查點天然兼具比當前策略更強的性能優勢和比外部來源更緊密的分佈親和性,直接實現軌跡質量與方差成本的平衡。我們通過早期階段引導與後期平臺突破兩項人工干預驗證NPO效果,並進一步提出自适应變體AutoNPO,該方案能根據線上訓練信號自動觸發干預,並選擇最大化S值的引導檢查點。在Qwen3-VL-8B-Instruct模型配合GRPO的實驗中,NPO將平均性能從57.88提升至62.84,AutoNPO進一步推高至63.15,在加速收斂的同時突破了最終性能瓶頸。
English
Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher Q , more new knowledge to learn) and close enough (lower V , more readily absorbed) conditions required to maximize the effective learning signal S = Q/V. We propose Near-Future Policy Optimization (NPO), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose AutoNPO,an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes S. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.
PDF442April 24, 2026