Fast-dLLM：通過啟用KV緩存與平行解碼實現擴散式大語言模型的無訓練加速

摘要

基於擴散的大型語言模型（Diffusion LLMs）在非自迴歸文本生成方面展現出潛力，具備並行解碼的能力。然而，開源的Diffusion LLMs在實際推理速度上往往落後於自迴歸模型，這主要是由於缺乏鍵值（KV）緩存以及在同時解碼多個令牌時出現的質量下降問題。為彌補這一差距，我們引入了一種專為雙向擴散模型設計的新穎塊級近似KV緩存機制，該機制能夠實現緩存的重用，且性能下降微乎其微。此外，我們發現並行解碼中生成質量下降的根本原因在於條件獨立假設下令牌依賴關係的破壞。為解決這一問題，我們提出了一種基於置信度的並行解碼策略，該策略選擇性地解碼超過置信度閾值的令牌，從而減輕依賴違規並保持生成質量。在LLaDA和Dream模型上進行的多項LLM基準測試實驗結果表明，該方法在幾乎不損失精度的情況下，實現了高達27.6倍的吞吐量提升，縮小了與自迴歸模型的性能差距，為Diffusion LLMs的實際部署鋪平了道路。

English

Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to 27.6times throughput improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.

Fast-dLLM：通過啟用KV緩存與平行解碼實現擴散式大語言模型的無訓練加速

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

摘要

Support