ChatPaper.aiChatPaper

较小模型是GRPO中策略级多样性的天然探索者

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

June 2, 2026
作者: Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi, Yukang Chen, Dingdong Wang, Tianhe Wu, Junjie Wang, Yujiu Yang, Yu Qiao, Ruihang Chu
cs.AI

摘要

我们识别出在面向大语言模型的群体相对策略优化(GRPO)中,增强推演多样性的一个新维度。尽管GRPO依赖于多样化推演,但现有策略主要通过注入更多token级随机性来增加多样性,这可能会引入逐步噪声并导致不连贯的轨迹。我们发现,同一模型系列中的较小模型天然具有更高的策略级多样性,表现为随着样本数量增加,其pass@k指标优于较大模型。与token级噪声不同,这种多样性在时间上具有相关性,能保持逻辑一致性,并为梯度估计提供结构化的探索信号。为此,我们提出S2L-PO(小到大规模策略优化)框架,该框架利用固定的小模型作为自然探索器来训练更大模型。为平衡探索与利用,我们设计了一种渐进退火策略,从离线的小模型推演逐步过渡到大学习器自身的采样。这一转变巧妙避免了因小模型容量限制导致的训练中期性能下降,从而实现更快的收敛并解锁更高的性能上限。S2L-PO在多样化数学推理基准测试中提升了准确率(例如,使用1.7B探索器指导8B模型在AIME 24上提升8.8%),同时减少了推演计算量。
English
We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.