ChatPaper.aiChatPaper

分佈對齊序列蒸餾:實現卓越長鏈思維推理

Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

January 14, 2026
作者: Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, Jieping Ye
cs.AI

摘要

本報告介紹DASD-4B-Thinking——一個輕量級卻能力卓越的完全開源推理模型。該模型在數學、科學推理和代碼生成等挑戰性基準測試中,取得了同規模開源模型的SOTA性能,甚至超越多個更大規模的模型。我們首先批判性地重新審視學界廣泛採用的蒸餾範式:基於教師生成回應的SFT(序列微調),亦稱為序列級蒸餾。儘管近期一系列遵循此方案的研究展現了顯著效率與強大實證性能,但其主要立足於SFT視角,導致這些方法過度聚焦於設計SFT數據篩選的啟發式規則,卻在很大程度上忽略了蒸餾的核心原則——讓學生模型學習教師的完整輸出分佈以繼承其泛化能力。具體而言,我們指出現行實踐中的三個關鍵缺陷:i) 教師序列級分佈的表徵不足;ii) 教師輸出分佈與學生學習能力間的失配;iii) 教師強制訓練與自回歸推斷產生的曝光偏差。總體而言,這些缺陷反映了蒸餾過程中系統性缺乏明確的師生互動,致使蒸餾本質未被充分挖掘。為解決這些問題,我們提出多項方法創新,共同構建出增強的序列級蒸餾訓練流程。值得注意的是,DASD-4B-Thinking僅使用44.8萬訓練樣本就獲得競爭性結果——相比現有大多數開源工作減少了一個數量級。為支持社群研究,我們公開釋出模型與訓練資料集。
English
In this report, we introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model. It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation -- even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher-generated responses, also known as sequence-level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself -- enabling the student model to learn the teacher's full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher's sequence-level distribution; ii) Misalignment between the teacher's output distribution and the student's learning capacity; and iii) Exposure bias arising from teacher-forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher-student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence-level distillation training pipeline. Remarkably, DASD-4B-Thinking obtains competitive results using only 448K training samples -- an order of magnitude fewer than those employed by most existing open-source efforts. To support community research, we publicly release our models and the training dataset.
PDF434January 16, 2026