オフポリシーガイダンス下での推論学習

要旨

大規模推論モデル（LRM）の最近の進展により、多段階推論や自己反省といった高度な振る舞いが、単純なルールベースの報酬を用いた強化学習（RL）を通じて自然に出現することが示されています。しかし、既存のゼロRLアプローチは本質的に「オン・ポリシー」であり、モデル自身の出力に学習が限定され、初期能力を超えた推論能力を獲得することができません。本論文では、LUFFY（Learning to reason Under oFF-policY guidance）を提案します。これは、オフ・ポリシーの推論トレースを用いてゼロRLを拡張するフレームワークです。LUFFYは、トレーニング中にオフ・ポリシーのデモンストレーションとオン・ポリシーのロールアウトを組み合わせることで、模倣と探索を動的にバランスさせます。特に、混合ポリシートレーニング中に表面的で硬直的な模倣を避けるため、正則化重要度サンプリングによるポリシーシェイピングを提案します。注目すべきは、LUFFYが6つの数学ベンチマークで平均+7.0以上の向上を達成し、分布外タスクでは+6.2ポイント以上の優位性を示したことです。また、特に汎化能力において、模倣ベースの教師ありファインチューニング（SFT）を大幅に上回りました。分析によれば、LUFFYは効果的に模倣するだけでなく、デモンストレーションを超えた探索も行い、オフ・ポリシーガイダンスを用いて汎化可能な推論モデルをトレーニングするためのスケーラブルな道筋を提供します。

English

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Notably, we propose policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an over +7.0 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. It also substantially surpasses imitation-based supervised fine-tuning (SFT), particularly in generalization. Analysis shows LUFFY not only imitates effectively but also explores beyond demonstrations, offering a scalable path to train generalizable reasoning models with off-policy guidance.

オフポリシーガイダンス下での推論学習

Learning to Reason under Off-Policy Guidance

要旨

Support