ChatPaper.aiChatPaper

dKV-Cache:擴散語言模型的快取機制

dKV-Cache: The Cache for Diffusion Language Models

May 21, 2025
作者: Xinyin Ma, Runpeng Yu, Gongfan Fang, Xinchao Wang
cs.AI

摘要

扩散语言模型(Diffusion Language Models, DLMs)被视为自回归语言模型的有力竞争者。然而,扩散语言模型长期以来受限于推理速度缓慢的问题。一个核心挑战在于其非自回归架构和双向注意力机制阻碍了加速解码的关键值缓存(KV-cache)的应用。针对这一瓶颈,我们提出了一种类似KV-cache的机制——延迟KV-Cache,用于DLMs的去噪过程。我们的方法基于观察到不同token在扩散过程中具有不同的表示动态性,因此提出了一种延迟且条件化的键值状态缓存策略。我们设计了两种互补的变体来逐步缓存键和值:(1) dKV-Cache-Decode,它提供了几乎无损的加速,甚至在长序列上提升了性能,表明现有DLMs在推理过程中可能未充分利用上下文信息;(2) dKV-Cache-Greedy,它采用激进缓存策略,缩短了缓存生命周期,以一定的性能下降为代价,实现了更高的加速比,具有二次时间复杂度。最终,dKV-Cache在推理速度上实现了2到10倍的提升,大大缩小了自回归模型与扩散模型之间的差距。我们在多个基准测试上评估了dKV-Cache,在通用语言理解、数学推理及代码生成任务上均实现了加速。实验证明,缓存机制同样适用于DLMs,甚至可以在现有DLMs的基础上以无需额外训练的方式直接应用。
English
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive architecture and bidirectional attention preclude the key-value cache that accelerates decoding. We address this bottleneck by proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs. Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process. Accordingly, we propose a delayed and conditioned caching strategy for key and value states. We design two complementary variants to cache key and value step-by-step: (1) dKV-Cache-Decode, which provides almost lossless acceleration, and even improves performance on long sequences, suggesting that existing DLMs may under-utilise contextual information during inference. (2) dKV-Cache-Greedy, which has aggressive caching with reduced lifespan, achieving higher speed-ups with quadratic time complexity at the cost of some performance degradation. dKV-Cache, in final, achieves from 2-10x speedup in inference, largely narrowing the gap between ARs and DLMs. We evaluate our dKV-Cache on several benchmarks, delivering acceleration across general language understanding, mathematical, and code-generation benchmarks. Experiments demonstrate that cache can also be used in DLMs, even in a training-free manner from current DLMs.

Summary

AI-Generated Summary

PDF112May 22, 2025