Fast-dLLM：通过启用KV缓存与并行解码实现扩散LLM的无训练加速

摘要

基于扩散的大型语言模型（Diffusion LLMs）在非自回归文本生成及并行解码能力方面展现出潜力。然而，开源Diffusion LLMs的实际推理速度往往落后于自回归模型，这主要归因于缺乏键值（KV）缓存机制以及同时解码多个令牌时出现的质量下降问题。为弥合这一差距，我们引入了一种专为双向扩散模型设计的新型块级近似KV缓存机制，该机制在几乎不损失性能的前提下实现了缓存复用。此外，我们揭示了并行解码中生成质量下降的根本原因在于条件独立假设下令牌依赖关系的破坏。针对此问题，我们提出了一种置信度感知的并行解码策略，该策略选择性地解码超过置信度阈值的令牌，从而缓解依赖关系违反并保持生成质量。在LLaDA和Dream模型上的多项LLM基准测试表明，该方法实现了高达27.6倍的吞吐量提升，且精度损失微乎其微，显著缩小了与自回归模型的性能差距，为Diffusion LLMs的实际部署铺平了道路。

English

Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to 27.6times throughput improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.

Fast-dLLM：通过启用KV缓存与并行解码实现扩散LLM的无训练加速

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

摘要

Support