가까운 미래 정책 최적화

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 핵심적인 사후 훈련 방법론으로 자리 잡았습니다. 온-정책 탐사에 적합한 오프-정책 궤적을 도입하면 RLVR의 수렴 속도가 가속화되고 성능 상한선이 높아지지만, 이러한 궤적의 공급원을 찾는 것이 핵심 과제로 남아있습니다. 기존의 혼합 정책 방법은 외부 교사로부터 궤적을 수입하거나(고품질이지만 분포 차이가 큼) 과거 훈련 궤적을 재사용하거나(분포는 가깝지만 품질이 제한적) 하는 방식으로, 효과적 학습 신호 S = Q/V를 극대화하기 위해 필요한 '충분히 강력함(더 높은 Q, 학습할 새로운 지식이 더 많음)'과 '충분히 가까움(더 낮은 V, 흡수하기 더 쉬움)' 조건을 동시에 충족하지 못했습니다. 우리는 Near-Future Policy Optimization(NPO)을 제안합니다. 이는 정책의 근미래 자아로부터 학습하는 단순한 혼합 정책 방식으로, 동일 훈련 실행에서 이후에 생성된 체크포인트는 현재 정책보다 강력하면서도 외부 공급원보다 가까운, 즉 궤적 품질과 분산 비용을 직접적으로 균형 잡는 자연스러운 보조 궤적 공급원 역할을 합니다. 우리는 초기 단계 부트스트래핑과 후기 단계 정체기 돌파라는 두 가지 수동 개입을 통해 NPO를 검증하고, 더 나아가 온라인 훈련 신호로부터 자동으로 개입을 트리거하고 S를 극대화하는 가이드 체크포인트를 선택하는 적응형 변종인 AutoNPO를 제안합니다. Qwen3-VL-8B-Instruct와 GRPO를 사용한 실험에서 NPO는 평균 성능을 57.88에서 62.84로 향상시켰으며, AutoNPO는 이를 63.15까지 끌어올려 최종 성능 상한선을 높이면서도 수렴 속도를 가속화했습니다.

English

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher Q , more new knowledge to learn) and close enough (lower V , more readily absorbed) conditions required to maximize the effective learning signal S = Q/V. We propose Near-Future Policy Optimization (NPO), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose AutoNPO,an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes S. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.