더 작은 모델은 GRPO에서 정책 수준의 다양성을 위한 자연스러운 탐색자이다.

초록

우리는 그룹 상대 정책 최적화(GRPO)에서 대규모 언어 모델(LLM)의 롤아웃 다양성을 개선하기 위한 새로운 차원을 식별한다. GRPO는 다양한 롤아웃에 의존하지만, 기존의 주요 전략들은 주로 더 많은 토큰 수준의 무작위성을 도입하여 다양성을 증가시키며, 이는 단계별 노이즈를 유발하고 일관성 없는 궤적(trajectory)을 초래할 수 있다. 우리는 동일 모델 패밀리 내에서 더 작은 모델이 본질적으로 더 높은 정책 수준의 다양성을 나타내며, 이는 샘플 수가 증가할수록 더 큰 모델에 비해 우수한 pass@k 지표로 증명된다. 토큰 수준의 노이즈와 달리, 이러한 다양성은 시간적 상관성을 가지며 논리적 일관성을 유지하고, 기울기 추정을 위한 구조화된 탐색 신호를 제공한다. 따라서 우리는 S2L-PO(Small-to-Large Policy Optimization) 프레임워크를 제안한다. 이는 고정된 소형 모델을 자연 탐색기로 활용하여 대형 모델을 훈련하는 방식이다. 탐색과 활용의 균형을 맞추기 위해, 우리는 오프라인 소형 모델 롤아웃에서 대형 학습자의 자체 샘플링으로 전환하는 점진적 어닐링(progressive annealing) 전략을 설계한다. 이러한 전환은 소형 모델의 용량 한계로 인한 중간 훈련 성능 저하를 우아하게 방지하며, 더 빠른 수렴을 달성하고 더 높은 성능 상한을 열어준다. S2L-PO는 다양한 수학적 추론 벤치마크(예: 1.7B 탐색기로 8B 모델을 안내할 때 AIME 24에서 +8.8% 정확도 향상)에서 정확도를 개선하는 동시에 롤아웃 계산을 줄인다.

English

We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.