利用负面信号：从教师数据中进行强化蒸馏以提升大语言模型推理能力

摘要

近期模型蒸馏技术的进展表明，来自高级推理模型（如DeepSeek-R1、OpenAI的o1）的数据能够有效将复杂推理能力迁移至更小、更高效的学生模型。然而，标准实践采用拒绝采样，舍弃了错误的推理示例——这些数据虽宝贵却常被忽视。本文探讨了一个关键问题：在离线环境下，如何有效利用正负蒸馏推理轨迹以最大化大型语言模型（LLM）的推理性能？为此，我们提出了强化蒸馏（REDI）这一两阶段框架。第一阶段通过监督微调（SFT）从正面轨迹中学习；第二阶段则利用我们提出的REDI目标，结合正负轨迹进一步优化模型。这一新颖目标是一个简单、无需参考的损失函数，在此蒸馏场景下超越了DPO和SimPO等现有方法。我们的实证评估显示，REDI在数学推理任务上优于基于拒绝采样的SFT或结合DPO/SimPO的SFT基线。值得注意的是，Qwen-REDI-1.5B模型仅基于开放数据集Open-R1中的13.1万条正负示例进行后训练，在MATH-500（pass@1）上取得了83.1%的得分。其性能在多个数学推理基准上匹配甚至超越了DeepSeek-R1-Distill-Qwen-1.5B（该模型基于80万条专有数据进行后训练），为使用公开可用数据离线后训练的1.5B模型树立了新的性能标杆。

English

Recent advances in model distillation demonstrate that data from advanced reasoning models (e.g., DeepSeek-R1, OpenAI's o1) can effectively transfer complex reasoning abilities to smaller, efficient student models. However, standard practices employ rejection sampling, discarding incorrect reasoning examples -- valuable, yet often underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? To this end, We propose Reinforcement Distillation (REDI), a two-stage framework. Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT). Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective. This novel objective is a simple, reference-free loss function that outperforms established methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate REDI's superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks. Notably, the Qwen-REDI-1.5B model, post-trained on just 131k positive and negative examples from the open Open-R1 dataset, achieves an 83.1% score on MATH-500 (pass@1). Its performance matches or surpasses that of DeepSeek-R1-Distill-Qwen-1.5B (a model post-trained on 800k proprietary data) across various mathematical reasoning benchmarks, establishing a new state-of-the-art for 1.5B models post-trained offline with openly available data.

利用负面信号：从教师数据中进行强化蒸馏以提升大语言模型推理能力

Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning

摘要

Support