小型モデルはGRPOにおけるポリシーレベルの多様性のための自然な探索者である

要旨

我々は、LLMのためのグループ相対的政策最適化（GRPO）におけるロールアウトの多様性を強化する新たな次元を特定する。GRPOは多様なロールアウトに依存しているが、一般的な戦略は主にトークンレベルのランダム性をさらに注入することで多様性を高めており、これによりステップごとのノイズが生じ、一貫性のない軌跡につながる可能性がある。我々は、同一モデルファミリー内の小型モデルが、サンプル数が増加するにつれて大型モデルよりも優れたpass@kを示すことから、本質的にポリシーレベルの多様性が高いことを明らかにする。トークンレベルのノイズとは異なり、この多様性は時間的に相関しており、論理的一貫性を維持し、勾配推定のための構造化された探索信号を提供する。そこで我々は、固定された小型モデルを自然な探索器として活用し、大型モデルを訓練する枠組みであるS2L-PO（Small-to-Large Policy Optimization）を提案する。探索と活用のバランスを取るために、オフラインの小型モデルのロールアウトから大型学習器自身のサンプリングへと移行する漸進的アニーリング戦略を設計する。この移行により、小型モデルの容量制限に起因する訓練中期の性能低下を巧みに回避し、より速い収束を達成し、より高い性能上限を開放する。S2L-POは、多様な数学的推論ベンチマーク（例えば、1.7Bの探索器を用いて8Bモデルを指導した場合、AIME 24で+8.8%）において精度を向上させるとともに、ロールアウト計算を削減する。

English

We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.