GRPO와 온-정책 증류를 넘어서: 언어 모델 사후 훈련을 위한 경험적 희소-밀집 보상 원칙

초록

검증 가능한 레이블링된 훈련 데이터가 제약 조건인 환경에서는, 검증된 각 예시를 신중하게 할당해야 한다. 일반적인 관행은 이 데이터를 배포될 모델에 직접 사용하는 것, 예를 들어 배포용 학생(student) 모델에 GRPO를 실행하는 것이다. 본 논문은 이 방법이 종종 비효율적인 할당이라고 주장하는데, 이는 보상 밀도 원칙(reward-density principle)을 간과하기 때문이다. 즉, 희소 시퀀스 수준 보상(sparse sequence-level reward)은 탐색이 생산적인 모델을 훈련하는 데 사용해야 하며, 밀집 토큰 수준 교사 보상(dense token-level teacher reward)은 행동을 더 작은 모델로 압축하는 것이 목표일 때 사용해야 한다는 것이다. 이러한 관점에서 GRPO 방식의 희소 강화학습(sparse RL)과 OPD 방식의 밀집 교사 지도(dense teacher supervision)는 별개의 방법론이 아니라, 서로 다른 보상 밀도 체계(reward-density regimes)에 해당한다. 할당 규칙은 간단하다. 희소한 레이블링된 데이터를 상류(upstream)에서 보상 기반 행동으로 전환할 수 있는 가장 강력한 모델에 먼저 사용한 후, 그 행동을 하류(downstream)로 밀집 지도(dense supervision) 형태로 전달하는 것이다. 본 논문은 Qwen3 및 Llama 모델을 사용하여 검증 가능한 수학 문제에서 이 규칙을 평가한다. 고정된 Qwen3-1.7B 배포 학생 크기에서, 밀집 브리지(dense bridge)를 통해 증류된 8B 교사의 강화학습 개선 버전이 동일한 학생에 대한 직접 GRPO보다 우수한 성능을 보였으며, 동일한 교사에서 RL 적용 전에 전이한 경우는 성능이 낮았다. 브리지의 중요성은 다음과 같다. 교사 롤아웃에 대한 순방향 KL 워밍업(forward-KL warmup) 후 학생 롤아웃에 대한 OPD를 적용한 방식이, 브리지 이후 학생 측 희소 강화학습이 적용되기 전 MATH에서 일관되게 가장 강력한 성능을 보였으며, 표준 8B/14B 교사에 대해 최적의 사전 3단계(pre-Stage 3) AIME 성능 지점을 제공했다. 또한 브리지는 이후 학생 측 희소 강화학습의 효과성을 높였다. 즉, 학습되지 않은 학생에게 약하게 적용되던 GRPO가 브리지 이후 MATH에서 75.4%에서 78.5%로 향상되었으며, 짝지어진 재현 대조군(replay control)보다 2.8% 포인트 더 우수했다. 핵심 운영 원칙은 희소한 레이블링된 데이터를 가장 준비되지 않은 정책(policy)에 사용하지 않는 것이다. 희소 보상은 교사 측 탐색에 사용하고, 밀집 전이는 학생 압축에 사용하며, 학생 측 희소 보상은 브리지 이후에만 적용해야 한다.

English

In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated carefully. The standard practice is to use this data directly on the model that will be deployed, for example by running GRPO on the deployment student. We argue that this is often an inefficient allocation because it overlooks a reward-density principle: sparse sequence-level reward should train models where exploration is productive, while dense token-level teacher reward should be used where the aim is to compress behavior into a smaller model. In this view, GRPO-style sparse RL and OPD-style dense teacher supervision are not separate recipes; they are different reward-density regimes. The allocation rule is simple: use scarce labeled training data upstream on the strongest model that can turn it into reward-shaped behavior, then transfer that behavior downstream as dense supervision. We evaluate this rule on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student, while transfer from the same teacher before RL underperforms. The bridge is important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts is consistently strongest on MATH before any post-bridge student-side sparse RL, and also gives the best pre-Stage~3 AIME endpoints for the canonical 8B/14B teachers. The bridge also makes later student-side sparse RL effective: GRPO that is weak on a cold student lifts MATH from 75.4% to 78.5% after the bridge and outperforms a matched replay control by 2.8 points. The operational principal is to avoid using scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.

GRPO와 온-정책 증류를 넘어서: 언어 모델 사후 훈련을 위한 경험적 희소-밀집 보상 원칙

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

초록

Support