d^2Cache：デュアル適応キャッシングによる拡散ベースLLMの高速化

要旨

拡散ベースの大規模言語モデル（dLLM）は、その有望な性能にもかかわらず、依然として推論効率の低さに悩まされている。これは、dLLMが双方向アテンションに依存しており、自己回帰モデル（ARM）が行う標準的なキー・バリュー（KV）キャッシュを直接活用できないためである。この問題を解決するため、我々は訓練不要の近似KVキャッシュフレームワークであるDual aDaptive Cache（d^2Cache）を提案する。d^2Cacheは、各デコードステップでトークンを識別し、そのKV状態を適応的に更新するための2段階の細粒度選択戦略を特徴とし、残りのトークンのKV状態を再利用のためにキャッシュする。さらに、d^2Cacheは自然により信頼性の高いデコードの代替手段を提供し、準左から右への生成を可能にし、シーケンスの終端におけるトークンの早期過信を軽減する。代表的なdLLM（LLaDAおよびDream）を用いた広範な実験結果は、d^2Cacheが推論速度を大幅に向上させるだけでなく、生成品質においても一貫した改善をもたらすことを示している。コードはhttps://github.com/Kamichanw/d2Cacheで公開されている。

English

Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce Dual aDaptive Cache (d^2Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d^2Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d^2Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d^2Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The code is available at https://github.com/Kamichanw/d2Cache.

d^2Cache：デュアル適応キャッシングによる拡散ベースLLMの高速化

d^2Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

要旨

Support