ChatPaper.aiChatPaper

高效DLM:从自回归到扩散语言模型的速率突破与未来展望

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

December 16, 2025
作者: Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov
cs.AI

摘要

扩散语言模型(dLMs)作为一种支持并行非自回归生成的潜力范式崭露头角,但其从头训练时的学习效率仍落后于自回归(AR)语言模型。为此,我们研究AR-to-dLM转换技术,将预训练AR模型转化为兼具高速生成能力与任务精度的高效dLM。通过剖析现有AR-to-dLM方法在注意力模式与训练目标上的局限,我们提出了更有效的转换原则与方法论。具体而言:首先系统比较不同注意力模式,发现保持预训练AR权重分布对转换效果至关重要。据此提出基于分块注意力模式的持续预训练方案,该方案在块间保持因果性,同时在块内实现双向建模。相较于完全双向建模,此方法不仅能保留KV缓存优势,更能有效维持AR模型权重分布,实现精度与效率的双赢。其次,为缓解掩码标记分布(均匀分布vs强左向右倾向)在训练与测试阶段的差异,提出位置依赖的掩码策略,通过为后续标记分配更高掩码概率来模拟测试环境。基于此框架,我们深入探究dLM的注意力模式、训练动态等设计要素,为可扩展的AR-to-dLM转换提供实践指导。这些研究催生了Efficient-DLM模型家族,其性能超越当前最先进的AR模型与dLM——例如我们的Efficient-DLM 8B在Dream 7B和Qwen3 4B对比中,准确率分别提升5.4%/2.7%,吞吐量提高4.5倍/2.7倍。
English
Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.
PDF61December 18, 2025