Sparse-dLLM:通过动态缓存淘汰加速扩散式大语言模型
Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction
August 4, 2025
作者: Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
cs.AI
摘要
扩散大语言模型(dLLMs)在推理和并行解码方面实现了突破,但其推理过程中存在难以承受的二次方计算复杂度和内存开销。当前的缓存技术通过存储全层状态来加速解码,却带来了巨大的内存占用,限制了长上下文应用。我们对dLLMs中注意力模式的分析揭示了跨层稀疏性的持久存在,关键标记在解码步骤中保持显著,而低相关性标记始终不重要,这促使我们采用选择性缓存淘汰策略。我们提出了Sparse-dLLM,这是首个无需训练即可通过延迟双向稀疏缓存将动态缓存淘汰与稀疏注意力相结合的框架。通过利用标记显著性在步骤间的稳定性,它保留了关键标记,并采用注意力引导策略动态淘汰不重要的前缀/后缀条目。在LLaDA和Dream系列上的大量实验表明,Sparse-dLLM相比原始dLLMs实现了高达10倍的吞吐量提升,同时保持了可比的性能和相近的峰值内存成本,在效率和效果上均优于先前的方法。
English
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and
parallel decoding but suffer from prohibitive quadratic computational
complexity and memory overhead during inference. Current caching techniques
accelerate decoding by storing full-layer states, yet impose substantial memory
usage that limit long-context applications. Our analysis of attention patterns
in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining
salient across decoding steps and low-relevance tokens staying unimportant,
motivating selective cache eviction. We propose Sparse-dLLM, the first
training-free framework integrating dynamic cache eviction with sparse
attention via delayed bidirectional sparse caching. By leveraging the stability
of token saliency over steps, it retains critical tokens and dynamically evicts
unimportant prefix/suffix entries using an attention-guided strategy. Extensive
experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to
10times higher throughput than vanilla dLLMs, with comparable performance
and similar peak memory costs, outperforming previous methods in efficiency and
effectiveness.