ChatPaper.aiChatPaper

面向卓越长链思维推理的分布对齐序列蒸馏

Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

January 14, 2026
作者: Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, Jieping Ye
cs.AI

摘要

在本报告中,我们推出DASD-4B-Thinking——一个轻量级但能力卓越、完全开源的推理模型。该模型在数学、科学推理和代码生成等挑战性基准测试中,取得了同规模开源模型的SOTA性能,甚至超越了若干更大规模的模型。我们首先重新审视了社区广泛采用的蒸馏范式:基于教师模型生成答案的SFT(序列微调),即序列级蒸馏。尽管近期一系列遵循此方案的研究展现了卓越的效率和强劲的实证性能,但这些方法主要基于SFT视角,因而侧重于设计启发式规则进行SFT数据筛选,却很大程度上忽略了蒸馏的核心原则——让学生模型学习教师模型的完整输出分布以继承其泛化能力。具体而言,我们指出了当前实践中的三个关键局限:1)教师序列级分布的表征不足;2)教师输出分布与学生学习能力之间的错配;3)教师强制训练与自回归推理产生的曝光偏差。总体而言,这些缺陷反映了蒸馏过程中系统性缺乏显式的师生交互,导致蒸馏本质未被充分挖掘。为解决这些问题,我们提出了多项方法论创新,共同构成增强型序列级蒸馏训练流程。值得注意的是,DASD-4B-Thinking仅使用44.8万训练样本就获得了有竞争力的结果——比大多数现有开源工作采用的样本量少一个数量级。为支持社区研究,我们公开发布了模型及训练数据集。
English
In this report, we introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model. It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation -- even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher-generated responses, also known as sequence-level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself -- enabling the student model to learn the teacher's full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher's sequence-level distribution; ii) Misalignment between the teacher's output distribution and the student's learning capacity; and iii) Exposure bias arising from teacher-forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher-student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence-level distillation training pipeline. Remarkably, DASD-4B-Thinking obtains competitive results using only 448K training samples -- an order of magnitude fewer than those employed by most existing open-source efforts. To support community research, we publicly release our models and the training dataset.
PDF434January 16, 2026