较小模型是GRPO中策略级多样性的天然探索者

摘要

我们识别出在面向大语言模型的群体相对策略优化（GRPO）中，增强推演多样性的一个新维度。尽管GRPO依赖于多样化推演，但现有策略主要通过注入更多token级随机性来增加多样性，这可能会引入逐步噪声并导致不连贯的轨迹。我们发现，同一模型系列中的较小模型天然具有更高的策略级多样性，表现为随着样本数量增加，其pass@k指标优于较大模型。与token级噪声不同，这种多样性在时间上具有相关性，能保持逻辑一致性，并为梯度估计提供结构化的探索信号。为此，我们提出S2L-PO（小到大规模策略优化）框架，该框架利用固定的小模型作为自然探索器来训练更大模型。为平衡探索与利用，我们设计了一种渐进退火策略，从离线的小模型推演逐步过渡到大学习器自身的采样。这一转变巧妙避免了因小模型容量限制导致的训练中期性能下降，从而实现更快的收敛并解锁更高的性能上限。S2L-PO在多样化数学推理基准测试中提升了准确率（例如，使用1.7B探索器指导8B模型在AIME 24上提升8.8%），同时减少了推演计算量。

English

We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.