d^2Cache:通过双重自适应缓存加速基于扩散的大语言模型
d^2Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching
September 27, 2025
作者: Yuchu Jiang, Yue Cai, Xiangzhong Luo, Jiale Fu, Jiarui Wang, Chonghan Liu, Xu Yang
cs.AI
摘要
尽管基于扩散的大型语言模型(dLLMs)展现出令人瞩目的性能,但其推理效率仍显不足。这主要归因于dLLMs依赖双向注意力机制,无法像自回归模型(ARMs)那样直接受益于标准键值(KV)缓存。为解决这一问题,我们提出了双自适应缓存(d^2Cache),这是一个无需训练的近似KV缓存框架,旨在加速dLLM的推理过程。d^2Cache采用两阶段细粒度选择策略,在每一步解码时识别并自适应更新关键令牌的KV状态,同时缓存其余令牌的KV状态以供复用。此外,d^2Cache自然提供了一种更为可靠的解码替代方案,能够实现准从左至右的生成,并缓解序列末端令牌的过早过度自信问题。在LLaDA和Dream这两个代表性dLLM上的大量实验结果表明,d^2Cache不仅显著提升了推理速度,还在生成质量上实现了持续改进。相关代码已发布于https://github.com/Kamichanw/d2Cache。
English
Diffusion-based large language models (dLLMs), despite their promising
performance, still suffer from inferior inference efficiency. This is because
dLLMs rely on bidirectional attention and cannot directly benefit from the
standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle
this issue, we introduce Dual aDaptive Cache (d^2Cache), which is a
training-free approximate KV cache framework for accelerating dLLM inference.
d^2Cache features a two-stage fine-grained selection strategy to identify
tokens and adaptively update their KV states at each decoding step, while
caching the KV states of the remaining tokens for reuse. Furthermore,
d^2Cache naturally offers a more reliable decoding alternative, which can
enable quasi left-to-right generation and mitigate premature overconfidence in
tokens at the end of the sequence. Extensive experimental results on two
representative dLLMs (\ie, LLaDA and Dream) demonstrate that d^2Cache not
only achieves substantial inference speedups, but also yields consistent
improvements in generation quality. The code is available at
https://github.com/Kamichanw/d2Cache.