GRPOとオン・ポリシー蒸留を超えて：言語モデルのポストトレーニングのための経験的スパース・トゥ・デンス報酬原理

要旨

ラベル付き検証可能な学習データが制約条件となる設定では、確認済みの各例を慎重に割り当てる必要がある。標準的な方法では、このデータをデプロイされるモデルに直接使用する。例えば、デプロイ対象の学生モデルに対してGRPOを実行する。我々は、この割り当てがしばしば非効率であると主張する。なぜなら、報酬密度の原理を見落としているからである。すなわち、スパースな系列レベルの報酬は探索が生産的なモデルを訓練すべきであり、一方、密なトークンレベルの教師報酬は、動作をより小さなモデルに圧縮することを目的とする場合に使用すべきである。この見方では、GRPO型のスパース強化学習とOPD型の密な教師監視は別個の手法ではなく、異なる報酬密度の枠組みである。割り当てのルールは単純である。すなわち、希少なラベル付き学習データを上流で、それを報酬整形された行動に変換できる最強のモデルに使用し、その後その行動を密な監視として下流に転送する。我々はこのルールを、Qwen3およびLlamaモデルを用いた検証可能な数学問題で評価する。固定されたQwen3-1.7Bのデプロイ学生モデルサイズにおいて、密なブリッジを通じて蒸留されたRL改善済み8B教師モデルは、同じ学生モデルに対する直接のGRPOを上回る性能を示す。一方、RL前の同じ教師モデルからの転送は性能が劣る。ブリッジは重要である。教師のロールアウトに対するforward-KLウォームアップと、それに続く学生ロールアウトに対するOPDは、ブリッジ後の学生側スパースRLを行う前のMATHで一貫して最強であり、標準的な8B/14B教師モデルに対して、Stage 3前のAIMEエンドポイントでも最良の結果をもたらす。ブリッジはまた、その後の学生側スパースRLを効果的にする。すなわち、未学習の学生モデルでは弱いGRPOが、ブリッジ後にはMATHを75.4%から78.5%に引き上げ、マッチングされたリプレイ対照群を2.8ポイント上回る。運用上の原則は、準備が最も整っていない方策に希少なラベル付きデータを使用することを避けることである。すなわち、教師側の発見にはスパース報酬を、学生モデルの圧縮には密な転送を、そして学生側のスパース報酬はブリッジ後にのみ使用する。

English

In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated carefully. The standard practice is to use this data directly on the model that will be deployed, for example by running GRPO on the deployment student. We argue that this is often an inefficient allocation because it overlooks a reward-density principle: sparse sequence-level reward should train models where exploration is productive, while dense token-level teacher reward should be used where the aim is to compress behavior into a smaller model. In this view, GRPO-style sparse RL and OPD-style dense teacher supervision are not separate recipes; they are different reward-density regimes. The allocation rule is simple: use scarce labeled training data upstream on the strongest model that can turn it into reward-shaped behavior, then transfer that behavior downstream as dense supervision. We evaluate this rule on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student, while transfer from the same teacher before RL underperforms. The bridge is important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts is consistently strongest on MATH before any post-bridge student-side sparse RL, and also gives the best pre-Stage~3 AIME endpoints for the canonical 8B/14B teachers. The bridge also makes later student-side sparse RL effective: GRPO that is weak on a cold student lifts MATH from 75.4% to 78.5% after the bridge and outperforms a matched replay control by 2.8 points. The operational principal is to avoid using scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.

GRPOとオン・ポリシー蒸留を超えて：言語モデルのポストトレーニングのための経験的スパース・トゥ・デンス報酬原理

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

要旨

Support