較小模型是GRPO中策略層級多樣性的自然探索者
Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO
June 2, 2026
作者: Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi, Yukang Chen, Dingdong Wang, Tianhe Wu, Junjie Wang, Yujiu Yang, Yu Qiao, Ruihang Chu
cs.AI
摘要
我們識別出在大語言模型的組相對策略優化(GRPO)中增強生成多樣性(rollout diversity)的新維度。雖然GRPO依賴多樣化的生成軌跡,但現有策略主要透過注入更多token層級的隨機性來提升多樣性,這可能引入逐步驟雜訊並導致不連貫的軌跡。我們發現,同一模型家族中較小的模型天生具有更高的策略層級多樣性——隨著樣本數量增加,其pass@k指標優於較大模型即為明證。與token層級雜訊不同,這種多樣性具有時間相關性,保持邏輯連貫性,並為梯度估計提供結構化探索訊號。因此我們提出S2L-PO(小到大的策略優化)框架,利用固定的小型模型作為自然探索器來訓練大型模型。為平衡探索與利用,我們設計了漸進式退火策略,從離線的小模型生成平滑過渡到大學習者自身的取樣。這一轉變巧妙避免了因小模型容量限制導致的中期訓練性能下降,實現更快的收斂並解鎖更高性能上限。S2L-PO在多種數學推理基準測試中提升了準確率(例如,使用1.7B探索器引導8B模型時,AIME 24準確率提升8.8%),同時減少了生成計算量。
English
We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.