并非所有去噪步骤都同等重要：面向快速掩码扩散语言模型的调度策略优化

摘要

近期掩碼擴散語言模型（MDLM）的發展縮小了與自迴歸語言模型的質量差距，但其採樣過程仍顯昂貴——生成需要對大型Transformer進行多次全序列去噪處理，且無法像自迴歸解碼那樣受益於KV緩存。本研究利用擴散框架的靈活性，探索模型調度策略：在部分去噪步驟中使用小型MDLM替代完整模型。基於OpenWebText和LM1B數據集訓練的模型實驗表明，與中間步驟相比，早期和晚期去噪步驟對此類替換具有顯著魯棒性。在無條件生成與前綴條件生成任務中，該策略能以僅輕微增加生成困惑度為代價，實現高達17%的浮點運算量降低，同時保持樣本多樣性。我們通過基於損失函數的步驟重要性分析、大小模型間KL散度的時步對比，以及對粗粒度步驟段的窮舉搜索，驗證了上述發現。這些方法一致表明擴散軌跡的中段具有最高的敏感性，且該現象在不同數據集間保持穩定。我們的結果表明，簡潔的架構無關調度規則能顯著加速MDLM採樣，同時基本保持生成質量。

English

Recent advances in masked diffusion language models (MDLMs) narrow the quality gap to autoregressive LMs, but their sampling remains expensive because generation requires many full-sequence denoising passes with a large Transformer and, unlike autoregressive decoding, cannot benefit from KV caching. In this work, we exploit the flexibility of the diffusion framework and study model scheduling, where a smaller MDLM replaces the full model at a subset of denoising steps. Across models trained on OpenWebText and LM1B, we show that early and late denoising steps are substantially more robust to such replacement than middle steps, enabling up to a 17% reduction in FLOPs with only modest degradation in generative perplexity under both unconditional and prefix-conditional generation, while preserving sample diversity. We support these findings with a step-importance analysis based on loss and KL divergence between small and large models across timesteps, as well as an exhaustive search over coarse step segments, both of which identify the middle of the diffusion trajectory as most sensitive consistently across datasets. Our results suggest that simple, architecture-agnostic scheduling rules can significantly accelerate MDLM sampling while largely preserving generation quality.

并非所有去噪步骤都同等重要：面向快速掩码扩散语言模型的调度策略优化

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

摘要

Support