從自我未來學習：面向dLLMs的同策略自我蒸餾

摘要

同策略自我蒸馏（OPSD）已被证明能有效对大型语言模型（LLMs）进行后训练，但其在扩散语言模型（dLLMs）上的应用仍有待探索。现有的OPSD方法本质上是自回归导向的：它们通过从左到右的前缀条件化以及符元级差异监督注入特权信息，这种设计与dLLMs的任意顺序生成存在根本冲突。我们提出d-OPSD，首个专为dLLMs设计的OPSD框架。该方法包含两项核心贡献：首先，将自教师构建重新定义为使用自生成答案作为后缀条件化，使学生模型能够从「自我未来经验」而非特权前缀中学习；其次，将监督机制从符元级转为步级，使训练过程与dLLMs的迭代去噪机制相契合。在四个推理基准上的实验表明，d-OPSD在样本效率上始终优于RLVR和SFT基线，仅需RLVR约10%的优化步骤，为dLLMs的后训练开辟了可行路径。代码已公开于https://github.com/xingzhejun/d-OPSD。

English

On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.