d^2Cache：通过双重自适应缓存加速基于扩散的大语言模型

摘要

尽管基于扩散的大型语言模型（dLLMs）展现出令人瞩目的性能，但其推理效率仍显不足。这主要归因于dLLMs依赖双向注意力机制，无法像自回归模型（ARMs）那样直接受益于标准键值（KV）缓存。为解决这一问题，我们提出了双自适应缓存（d^2Cache），这是一个无需训练的近似KV缓存框架，旨在加速dLLM的推理过程。d^2Cache采用两阶段细粒度选择策略，在每一步解码时识别并自适应更新关键令牌的KV状态，同时缓存其余令牌的KV状态以供复用。此外，d^2Cache自然提供了一种更为可靠的解码替代方案，能够实现准从左至右的生成，并缓解序列末端令牌的过早过度自信问题。在LLaDA和Dream这两个代表性dLLM上的大量实验结果表明，d^2Cache不仅显著提升了推理速度，还在生成质量上实现了持续改进。相关代码已发布于https://github.com/Kamichanw/d2Cache。

English

Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce Dual aDaptive Cache (d^2Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d^2Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d^2Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d^2Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The code is available at https://github.com/Kamichanw/d2Cache.

d^2Cache：通过双重自适应缓存加速基于扩散的大语言模型

d^2Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

摘要

Support