較小模型是GRPO中策略層級多樣性的自然探索者

摘要

我們識別出在大語言模型的組相對策略優化（GRPO）中增強生成多樣性（rollout diversity）的新維度。雖然GRPO依賴多樣化的生成軌跡，但現有策略主要透過注入更多token層級的隨機性來提升多樣性，這可能引入逐步驟雜訊並導致不連貫的軌跡。我們發現，同一模型家族中較小的模型天生具有更高的策略層級多樣性——隨著樣本數量增加，其pass@k指標優於較大模型即為明證。與token層級雜訊不同，這種多樣性具有時間相關性，保持邏輯連貫性，並為梯度估計提供結構化探索訊號。因此我們提出S2L-PO（小到大的策略優化）框架，利用固定的小型模型作為自然探索器來訓練大型模型。為平衡探索與利用，我們設計了漸進式退火策略，從離線的小模型生成平滑過渡到大學習者自身的取樣。這一轉變巧妙避免了因小模型容量限制導致的中期訓練性能下降，實現更快的收斂並解鎖更高性能上限。S2L-PO在多種數學推理基準測試中提升了準確率（例如，使用1.7B探索器引導8B模型時，AIME 24準確率提升8.8%），同時減少了生成計算量。

English

We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.