Fast-dLLM:通过启用KV缓存与并行解码实现扩散LLM的无训练加速
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
May 28, 2025
作者: Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, Enze Xie
cs.AI
摘要
基于扩散的大型语言模型(Diffusion LLMs)在非自回归文本生成及并行解码能力方面展现出潜力。然而,开源Diffusion LLMs的实际推理速度往往落后于自回归模型,这主要归因于缺乏键值(KV)缓存机制以及同时解码多个令牌时出现的质量下降问题。为弥合这一差距,我们引入了一种专为双向扩散模型设计的新型块级近似KV缓存机制,该机制在几乎不损失性能的前提下实现了缓存复用。此外,我们揭示了并行解码中生成质量下降的根本原因在于条件独立假设下令牌依赖关系的破坏。针对此问题,我们提出了一种置信度感知的并行解码策略,该策略选择性地解码超过置信度阈值的令牌,从而缓解依赖关系违反并保持生成质量。在LLaDA和Dream模型上的多项LLM基准测试表明,该方法实现了高达27.6倍的吞吐量提升,且精度损失微乎其微,显著缩小了与自回归模型的性能差距,为Diffusion LLMs的实际部署铺平了道路。
English
Diffusion-based large language models (Diffusion LLMs) have shown promise for
non-autoregressive text generation with parallel decoding capabilities.
However, the practical inference speed of open-sourced Diffusion LLMs often
lags behind autoregressive models due to the lack of Key-Value (KV) Cache and
quality degradation when decoding multiple tokens simultaneously. To bridge
this gap, we introduce a novel block-wise approximate KV Cache mechanism
tailored for bidirectional diffusion models, enabling cache reuse with
negligible performance drop. Additionally, we identify the root cause of
generation quality degradation in parallel decoding as the disruption of token
dependencies under the conditional independence assumption. To address this, we
propose a confidence-aware parallel decoding strategy that selectively decodes
tokens exceeding a confidence threshold, mitigating dependency violations and
maintaining generation quality. Experimental results on LLaDA and Dream models
across multiple LLM benchmarks demonstrate up to 27.6times
throughput improvement with minimal accuracy loss, closing the performance gap
with autoregressive models and paving the way for practical deployment of
Diffusion LLMs.Summary
AI-Generated Summary