ChatPaper.aiChatPaper

向自我未来学习:面向dLLMs的在策略自蒸馏

Learning from the Self-future: On-policy Self-distillation for dLLMs

June 16, 2026
作者: Yifu Luo, Zeyu Chen, Haoyu Wang, Xinhao Hu, Yuxuan Zhang, Zhizhou Sha, Shiwei Liu
cs.AI

摘要

同策略自蒸馏(OPSD)已被证明能有效对大型语言模型(LLMs)进行后训练,但其在扩散语言模型(dLLMs)中的应用仍属空白。现有OPSD方法本质上以自回归为核心,通过从左到右的前缀条件化及令牌级差异监督注入特权信息——这种设计从根本上与dLLMs的任意顺序生成模式相冲突。我们提出d-OPSD,这是首个专为dLLMs设计的同策略自蒸馏框架。该方法包含两项核心贡献:首先,我们重构了自教师模型的构建方式,采用自生成答案作为后缀条件化,使学生模型能够从"自我未来经验"而非特权前缀中学习;其次,我们将监督从令牌级转变为步骤级,使训练过程与dLLMs的迭代去噪特性保持一致。在四项推理基准上的实验表明,d-OPSD以更优的样本效率持续超越RLVR和SFT基线,仅需RLVR约10%的优化步骤即可达到同等性能,为dLLM后训练开辟了有前景的新路径。代码已开源至https://github.com/xingzhejun/d-OPSD。
English
On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.