ChatPaper.aiChatPaper

高效DLM:从自回归到扩散语言模型的超越与加速之路

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

December 16, 2025
作者: Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov
cs.AI

摘要

扩散语言模型(dLMs)已成为实现并行非自回归生成的潜力范式,但其从头开始训练时的学习效率仍落后于自回归(AR)语言模型。为此,我们研究AR-to-dLM转换方法,将预训练的AR模型转化为高效dLMs,在保持AR模型任务精度的同时实现生成速度的飞跃。通过剖析现有AR-to-dLM方法在注意力机制和训练目标上的局限,我们提出了更有效的转换原则与方法。具体而言:首先系统比较不同注意力模式,发现保持预训练AR权重分布对转换效果至关重要。据此提出基于分块注意力模式的持续预训练方案,在块间保持因果性的同时实现块内双向建模。该方法相比完全双向建模不仅能更好地保留AR模型权重分布,还兼具支持KV缓存的技术优势,实现精度与效率的双赢。其次,为缓解掩码token分布(均匀分布vs强左向右分布)在训练与推理阶段的差异,提出位置相关掩码策略,在训练阶段对后续token赋予更高掩码概率以模拟推理行为。基于该框架,我们深入探究了dLMs的注意力模式、训练动态等设计选择,为可扩展的AR-to-dLM转换提供实践指导。由此诞生的Efficient-DLM模型系列在精度和效率上均超越现有最优模型,例如我们的Efficient-DLM 8B相比Dream 7B和Qwen3 4B分别实现精度提升5.4%/2.7%,吞吐量提高4.5倍/2.7倍。
English
Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.
PDF61December 18, 2025