dKV-Cache:扩散语言模型的缓存机制
dKV-Cache: The Cache for Diffusion Language Models
May 21, 2025
作者: Xinyin Ma, Runpeng Yu, Gongfan Fang, Xinchao Wang
cs.AI
摘要
扩散语言模型(DLMs)被视为自回归语言模型的有力竞争者。然而,扩散语言模型长期以来受限于推理速度缓慢的问题。核心挑战在于其非自回归架构和双向注意力机制无法利用加速解码的键值缓存。我们通过为DLMs的去噪过程提出一种类似KV缓存的机制——延迟KV缓存(delayed KV-Cache),来解决这一瓶颈。我们的方法基于观察到不同token在扩散过程中具有不同的表示动态性,因此提出了一种延迟且条件化的键值状态缓存策略。我们设计了两种互补的变体来逐步缓存键和值:(1) dKV-Cache-Decode,它提供了几乎无损的加速,甚至在长序列上提升了性能,表明现有DLMs在推理过程中可能未充分利用上下文信息;(2) dKV-Cache-Greedy,采用激进缓存并缩短生命周期,以一定的性能下降为代价,实现了更高的加速比,时间复杂度为二次方。最终,dKV-Cache在推理速度上实现了2到10倍的提升,大幅缩小了自回归模型与扩散模型之间的差距。我们在多个基准测试上评估了dKV-Cache,在通用语言理解、数学和代码生成任务上均实现了加速。实验证明,缓存同样适用于DLMs,甚至可以在现有DLMs上以无需训练的方式直接应用。
English
Diffusion Language Models (DLMs) have been seen as a promising competitor for
autoregressive language models. However, diffusion language models have long
been constrained by slow inference. A core challenge is that their
non-autoregressive architecture and bidirectional attention preclude the
key-value cache that accelerates decoding. We address this bottleneck by
proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising
process of DLMs. Our approach is motivated by the observation that different
tokens have distinct representation dynamics throughout the diffusion process.
Accordingly, we propose a delayed and conditioned caching strategy for key and
value states. We design two complementary variants to cache key and value
step-by-step: (1) dKV-Cache-Decode, which provides almost lossless
acceleration, and even improves performance on long sequences, suggesting that
existing DLMs may under-utilise contextual information during inference. (2)
dKV-Cache-Greedy, which has aggressive caching with reduced lifespan, achieving
higher speed-ups with quadratic time complexity at the cost of some performance
degradation. dKV-Cache, in final, achieves from 2-10x speedup in inference,
largely narrowing the gap between ARs and DLMs. We evaluate our dKV-Cache on
several benchmarks, delivering acceleration across general language
understanding, mathematical, and code-generation benchmarks. Experiments
demonstrate that cache can also be used in DLMs, even in a training-free manner
from current DLMs.Summary
AI-Generated Summary