CEPO：使用對比證據策略優化的RLVR自我蒸餾

摘要

當模型在可驗證獎勵的強化學習（RLVR）下產生正確解答時，每個詞元（token）都會收到相同的獎勵信號，無論該詞元是關鍵推理步驟還是語法填充詞。自然的解決方法是將模型建立在正確答案的「教師信號」上，識別出如果模型事先知道答案，它會產生不同輸出的那些詞元。先前的研究顯示，這樣做要麼將答案洩漏到梯度中而破壞訓練，要麼產生無法區分關鍵步驟與填充詞的微弱信號，因為相對於模型的基準分佈，兩者看起來同樣令人驚訝。我們提出對比證據策略優化（CEPO），該方法在每個詞元處提出更精確的問題：不僅是「正確答案是否偏好此詞元？」，而是「正確答案偏好此詞元，同時錯誤答案對此詞元不偏好？」。同時滿足這兩個條件的詞元是真正的推理步驟；兩者皆不滿足的則是填充詞。錯誤答案的教師信號是從訓練批次中已有的被拒絕軌跡（rejected rollouts）構建而成，不產生額外的抽樣成本。我們證明CEPO繼承了先前最新方法的全部結構性安全保障，同時嚴格增強了關鍵詞元處的信用分配，而這種改善在填充詞位置上恰好消失。在實驗中，CEPO在五個多模態數學推理基準測試上，2B與4B規模的平均準確率分別達到43.43%與60.56%，相較之下，在相同訓練預算下，GRPO的平均準確率為41.17%與57.43%。分佈匹配自蒸餾方法（OPSD、SDPO）的表現低於未訓練的基線，這在實驗上證實了我們理論所預測的資訊洩漏問題。我們的程式碼可於 https://github.com/ahmedheakl/CEPO 取得。

English

When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong-answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution-matching self-distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at https://github.com/ahmedheakl/CEPO.