d^2Cache：通過雙重自適應緩存加速基於擴散的語言模型

摘要

基於擴散的大型語言模型（dLLMs）儘管表現出色，但在推理效率方面仍存在不足。這是因為dLLMs依賴於雙向注意力機制，無法像自回歸模型（ARMs）那樣直接受益於標準的鍵值（KV）緩存。為解決這一問題，我們引入了雙重自適應緩存（d^2Cache），這是一個無需訓練的近似KV緩存框架，旨在加速dLLM的推理過程。d^2Cache採用了一種兩階段的細粒度選擇策略，在每個解碼步驟中識別並自適應更新特定詞元的KV狀態，同時緩存其餘詞元的KV狀態以供重用。此外，d^2Cache自然地提供了一種更可靠的解碼替代方案，能夠實現準左至右的生成，並緩解序列末尾詞元過早過度自信的問題。在兩個代表性dLLM（即LLaDA和Dream）上的大量實驗結果表明，d^2Cache不僅實現了顯著的推理加速，還在生成質量上帶來了持續的改進。代碼已公開於https://github.com/Kamichanw/d2Cache。

English

Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce Dual aDaptive Cache (d^2Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d^2Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d^2Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d^2Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The code is available at https://github.com/Kamichanw/d2Cache.

d^2Cache：通過雙重自適應緩存加速基於擴散的語言模型

d^2Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

摘要

Support