dKV-Cache：扩散语言模型的缓存机制

摘要

扩散语言模型（DLMs）被视为自回归语言模型的有力竞争者。然而，扩散语言模型长期以来受限于推理速度缓慢的问题。核心挑战在于其非自回归架构和双向注意力机制无法利用加速解码的键值缓存。我们通过为DLMs的去噪过程提出一种类似KV缓存的机制——延迟KV缓存（delayed KV-Cache），来解决这一瓶颈。我们的方法基于观察到不同token在扩散过程中具有不同的表示动态性，因此提出了一种延迟且条件化的键值状态缓存策略。我们设计了两种互补的变体来逐步缓存键和值：(1) dKV-Cache-Decode，它提供了几乎无损的加速，甚至在长序列上提升了性能，表明现有DLMs在推理过程中可能未充分利用上下文信息；(2) dKV-Cache-Greedy，采用激进缓存并缩短生命周期，以一定的性能下降为代价，实现了更高的加速比，时间复杂度为二次方。最终，dKV-Cache在推理速度上实现了2到10倍的提升，大幅缩小了自回归模型与扩散模型之间的差距。我们在多个基准测试上评估了dKV-Cache，在通用语言理解、数学和代码生成任务上均实现了加速。实验证明，缓存同样适用于DLMs，甚至可以在现有DLMs上以无需训练的方式直接应用。

English

Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive architecture and bidirectional attention preclude the key-value cache that accelerates decoding. We address this bottleneck by proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs. Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process. Accordingly, we propose a delayed and conditioned caching strategy for key and value states. We design two complementary variants to cache key and value step-by-step: (1) dKV-Cache-Decode, which provides almost lossless acceleration, and even improves performance on long sequences, suggesting that existing DLMs may under-utilise contextual information during inference. (2) dKV-Cache-Greedy, which has aggressive caching with reduced lifespan, achieving higher speed-ups with quadratic time complexity at the cost of some performance degradation. dKV-Cache, in final, achieves from 2-10x speedup in inference, largely narrowing the gap between ARs and DLMs. We evaluate our dKV-Cache on several benchmarks, delivering acceleration across general language understanding, mathematical, and code-generation benchmarks. Experiments demonstrate that cache can also be used in DLMs, even in a training-free manner from current DLMs.

dKV-Cache：扩散语言模型的缓存机制

dKV-Cache: The Cache for Diffusion Language Models

摘要

Support