基于在线策略蒸馏的数据高效自回归到扩散语言模型
Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation
June 4, 2026
作者: Xingyu Su, Jacob Helwig, Shubham Parashar, Atharv Chagi, Lakshmi Jotsna, Degui Zhi, James Caverlee, Dileep Kalathil, Shuiwang Ji
cs.AI
摘要
我们研究自回归模型(ARLM)向扩散语言模型(DLM)的转化过程。现有工作并非从头预训练,而是将ARLM中的因果注意力替换为双向注意力,随后用DLM目标训练得到的模型。然而,这些方法面临两种分布偏移:其一,从下一词元预测目标转向DLM目标时,会丢弃ARLM在训练中习得的知识;其二,标准DLM存在训练-推理不匹配问题,因为其训练损失定义在随机遮蔽序列上,而非推理阶段基于置信度解码所遭遇的轨迹。为解决上述挑战,我们提出基于策略的扩散语言模型(OPDLM),该模型采用基于策略的蒸馏(OPD)实现ARLM到DLM的转化。具体而言,OPDLM通过自策略蒸馏进行训练:学生模型(采用双向注意力的ARLM)生成自身轨迹,教师模型(原始冻结的ARLM)通过在这些轨迹上提供目标logits来蒸馏知识。由于直接采用基于策略的训练方式,OPDLM消除了DLM中的训练-推理不匹配问题,同时通过从原始模型进行蒸馏增强了ARLM知识的保留。实验结果表明,OPDLM仅需15倍至7000倍更少的训练词元,即可在广泛任务中展现强劲性能。OPDLM避免了DLM预训练的高昂成本,并将DLM转化定位为ARLM的一种后期训练形式。
English
We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.