Sparse-dLLM: 動的キャッシュ削除による拡散型LLMの高速化

要旨

Diffusion Large Language Models（dLLMs）は、推論と並列デコードにおいて画期的な進展をもたらす一方で、推論時の計算量とメモリオーバーヘッドが二次的に増大するという課題を抱えています。現在のキャッシュ技術は、全層の状態を保存することでデコードを加速しますが、大幅なメモリ使用量を伴い、長文脈アプリケーションの制約となっています。dLLMsのアテンションパターンを分析した結果、層を跨ぐスパース性が持続し、重要なトークンはデコードステップを通じて顕著なままである一方、関連性の低いトークンは重要性を保たないことが明らかになり、選択的なキャッシュ削除の必要性が示唆されました。本研究では、Sparse-dLLMを提案します。これは、動的なキャッシュ削除とスパースアテンションを遅延双方向スパースキャッシングにより統合した、初のトレーニング不要なフレームワークです。トークンの重要性がステップ間で安定している特性を活用し、重要なトークンを保持しつつ、アテンションに基づく戦略を用いて重要でないプレフィックス/サフィックスエントリを動的に削除します。LLaDAおよびDreamシリーズでの大規模な実験により、Sparse-dLLMは従来のdLLMsと比較して最大10倍のスループットを達成し、同等の性能と同程度のピークメモリコストを維持しながら、効率と有効性の両面で従来手法を上回ることが実証されました。

English

Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage that limit long-context applications. Our analysis of attention patterns in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining salient across decoding steps and low-relevance tokens staying unimportant, motivating selective cache eviction. We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching. By leveraging the stability of token saliency over steps, it retains critical tokens and dynamically evicts unimportant prefix/suffix entries using an attention-guided strategy. Extensive experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to 10times higher throughput than vanilla dLLMs, with comparable performance and similar peak memory costs, outperforming previous methods in efficiency and effectiveness.

Sparse-dLLM: 動的キャッシュ削除による拡散型LLMの高速化

Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

要旨

Support