擴散語言模型在解碼前已知答案

摘要

擴散語言模型（DLMs）近年來作為自迴歸方法的一種替代方案嶄露頭角，提供了並行序列生成和靈活的詞元順序。然而，其推理速度仍慢於自迴歸模型，主要歸因於雙向注意力的計算成本以及高質量輸出所需的大量精煉步驟。在本研究中，我們強調並利用了一個被忽視的DLM特性——早期答案收斂：在許多情況下，正確答案可以在最終解碼步驟之前，在半步階段內部識別，無論是在半自迴歸還是隨機重掩碼的調度下。例如，在GSM8K和MMLU數據集上，分別有高達97%和99%的實例僅需一半的精煉步驟即可正確解碼。基於這一觀察，我們引入了Prophet，一種無需訓練的快速解碼範式，實現了早期承諾解碼。具體而言，Prophet動態決定是繼續精煉還是“全押”（即一步解碼所有剩餘詞元），使用前兩個預測候選之間的置信度差距作為判斷標準。它無縫集成到現有的DLM實現中，引入的開銷可忽略不計，且無需額外訓練。對LLaDA-8B和Dream-7B在多個任務上的實證評估表明，Prophet在保持高生成質量的同時，將解碼步驟數量最多減少了3.4倍。這些結果將DLM解碼重新定義為何時停止採樣的問題，並證明了早期解碼收斂提供了一種簡單而強大的機制來加速DLM推理，與現有的加速技術相輔相成。我們的代碼已公開於https://github.com/pixeli99/Prophet。

English

Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders. However, their inference remains slower than that of autoregressive models, primarily due to the cost of bidirectional attention and the large number of refinement steps required for high quality outputs. In this work, we highlight and leverage an overlooked property of DLMs early answer convergence: in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding. Specifically, Prophet dynamically decides whether to continue refinement or to go "all-in" (i.e., decode all remaining tokens in one step), using the confidence gap between the top-2 prediction candidates as the criterion. It integrates seamlessly into existing DLM implementations, incurs negligible overhead, and requires no additional training. Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality. These results recast DLM decoding as a problem of when to stop sampling, and demonstrate that early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, complementary to existing speedup techniques. Our code is publicly available at https://github.com/pixeli99/Prophet.

擴散語言模型在解碼前已知答案

Diffusion Language Models Know the Answer Before Decoding

摘要

Support