扩散语言模型在解码前已知晓答案

摘要

扩散语言模型（DLMs）近期作为自回归方法的替代方案崭露头角，提供了并行序列生成和灵活的标记顺序。然而，其推理速度仍慢于自回归模型，主要归因于双向注意力的计算成本以及高质量输出所需的大量细化步骤。在本研究中，我们强调并利用了一个被忽视的DLM特性——早期答案收敛：在许多情况下，正确的答案可以在最终解码步骤之前，通过半自回归或随机重掩码调度，在中间步骤被内部识别。例如，在GSM8K和MMLU数据集上，分别有高达97%和99%的实例仅需一半的细化步骤即可正确解码。基于这一观察，我们引入了Prophet，一种无需训练、支持早期提交解码的快速解码范式。具体而言，Prophet利用前两大预测候选之间的置信度差距作为标准，动态决定是继续细化还是“全押”（即一步解码所有剩余标记）。它无缝集成到现有的DLM实现中，引入的额外开销微乎其微，且无需额外训练。对LLaDA-8B和Dream-7B在多任务上的实证评估显示，Prophet在保持高生成质量的同时，将解码步骤数最多减少了3.4倍。这些成果将DLM解码重新定义为何时停止采样的问题，并证明早期解码收敛为加速DLM推理提供了一种简单而强大的机制，与现有加速技术相辅相成。我们的代码已公开于https://github.com/pixeli99/Prophet。

English

Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders. However, their inference remains slower than that of autoregressive models, primarily due to the cost of bidirectional attention and the large number of refinement steps required for high quality outputs. In this work, we highlight and leverage an overlooked property of DLMs early answer convergence: in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding. Specifically, Prophet dynamically decides whether to continue refinement or to go "all-in" (i.e., decode all remaining tokens in one step), using the confidence gap between the top-2 prediction candidates as the criterion. It integrates seamlessly into existing DLM implementations, incurs negligible overhead, and requires no additional training. Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality. These results recast DLM decoding as a problem of when to stop sampling, and demonstrate that early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, complementary to existing speedup techniques. Our code is publicly available at https://github.com/pixeli99/Prophet.

扩散语言模型在解码前已知晓答案

Diffusion Language Models Know the Answer Before Decoding

摘要

Support