利用負面信號：從教師數據中進行強化蒸餾以提升大語言模型推理能力

摘要

近期模型蒸餾技術的進展表明，來自高級推理模型（如DeepSeek-R1、OpenAI的o1）的數據能有效將複雜推理能力轉移至更小、更高效的學生模型。然而，標準實踐採用拒絕採樣，捨棄錯誤的推理示例——這些是有價值但常被忽視的數據。本文探討了一個關鍵問題：在離線環境下，如何有效利用正負蒸餾推理軌跡以最大化大型語言模型（LLM）的推理性能？為此，我們提出了強化蒸餾（REDI），一個兩階段框架。第一階段通過監督微調（SFT）從正向軌跡中學習。第二階段則利用我們提出的REDI目標，結合正負軌跡進一步精煉模型。這一新穎目標是一個簡單、無參考的損失函數，在此蒸餾情境下超越了DPO和SimPO等既定方法。我們的實證評估顯示，在數學推理任務上，REDI優於基於拒絕採樣的SFT或結合DPO/SimPO的SFT。值得注意的是，Qwen-REDI-1.5B模型僅在開放的Open-R1數據集上對131k正負示例進行後訓練，便在MATH-500（pass@1）上取得了83.1%的成績。其性能在多個數學推理基準上與DeepSeek-R1-Distill-Qwen-1.5B（一個基於800k專有數據後訓練的模型）相當或更優，為使用公開可用數據進行離線後訓練的1.5B模型樹立了新的技術標杆。

English

Recent advances in model distillation demonstrate that data from advanced reasoning models (e.g., DeepSeek-R1, OpenAI's o1) can effectively transfer complex reasoning abilities to smaller, efficient student models. However, standard practices employ rejection sampling, discarding incorrect reasoning examples -- valuable, yet often underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? To this end, We propose Reinforcement Distillation (REDI), a two-stage framework. Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT). Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective. This novel objective is a simple, reference-free loss function that outperforms established methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate REDI's superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks. Notably, the Qwen-REDI-1.5B model, post-trained on just 131k positive and negative examples from the open Open-R1 dataset, achieves an 83.1% score on MATH-500 (pass@1). Its performance matches or surpasses that of DeepSeek-R1-Distill-Qwen-1.5B (a model post-trained on 800k proprietary data) across various mathematical reasoning benchmarks, establishing a new state-of-the-art for 1.5B models post-trained offline with openly available data.

利用負面信號：從教師數據中進行強化蒸餾以提升大語言模型推理能力

Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning

摘要

Support