超越GRPO與同策略蒸餾：語言模型後訓練的實證稀疏到稠密獎勵原則

摘要

在標記驗證訓練數據構成剛性約束的情境中，每個經過檢查的樣例都應審慎分配。標準做法是將此數據直接用於將部署的模型，例如在部署學生模型上執行GRPO。我們認為這通常是一種低效的分配，因為它忽視了獎勵密度原則：稀疏的序列級獎勵應用於訓練探索有效的模型，而密集的詞元級教師獎勵則應用於旨在將行為壓縮到較小模型的情況。依此觀點，GRPO風格的稀疏強化學習與OPD風格的密集教師監督並非互斥的配方，而是不同的獎勵密度機制。分配規則很簡單：將稀缺的標記訓練數據上游用於最強的模型，使其轉化為獎勵塑形的行為，再將該行為作為密集監督下游傳遞。我們在可驗證數學任務上使用Qwen3與Llama模型評估此規則。在固定Qwen3-1.7B部署學生模型大小的情況下，經密集橋樑蒸餾的RL改進型8B教師模型，其表現優於在同一學生模型上直接進行GRPO；而來自同一教師模型在RL之前的轉移則表現較差。橋樑至關重要：在教師模型展開上進行前向KL熱身，隨後在學生模型展開上進行OPD，該方法在MATH上始終最強（在橋樑後學生側稀疏RL之前），並為標準8B/14B教師模型提供了最佳的Stage 3前AIME終點。橋樑也使後續的學生側稀疏RL變得有效：在冷啟動學生模型上表現較弱的GRPO，經過橋樑後在MATH上從75.4%提升至78.5%，並以2.8個百分點勝過匹配的重放控制組。操作原則為：避免將稀缺標記數據用於準備最不充分的策略；將稀疏獎勵用於教師側發現，密集轉移用於學生壓縮，並僅在橋樑之後使用學生側稀疏獎勵。

English

In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated carefully. The standard practice is to use this data directly on the model that will be deployed, for example by running GRPO on the deployment student. We argue that this is often an inefficient allocation because it overlooks a reward-density principle: sparse sequence-level reward should train models where exploration is productive, while dense token-level teacher reward should be used where the aim is to compress behavior into a smaller model. In this view, GRPO-style sparse RL and OPD-style dense teacher supervision are not separate recipes; they are different reward-density regimes. The allocation rule is simple: use scarce labeled training data upstream on the strongest model that can turn it into reward-shaped behavior, then transfer that behavior downstream as dense supervision. We evaluate this rule on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student, while transfer from the same teacher before RL underperforms. The bridge is important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts is consistently strongest on MATH before any post-bridge student-side sparse RL, and also gives the best pre-Stage~3 AIME endpoints for the canonical 8B/14B teachers. The bridge also makes later student-side sparse RL effective: GRPO that is weak on a cold student lifts MATH from 75.4% to 78.5% after the bridge and outperforms a matched replay control by 2.8 points. The operational principal is to avoid using scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.

超越GRPO與同策略蒸餾：語言模型後訓練的實證稀疏到稠密獎勵原則

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

摘要

Support