Sparse-dLLM:通過動態緩存淘汰加速擴散式大型語言模型
Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction
August 4, 2025
作者: Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
cs.AI
摘要
擴散式大型語言模型(dLLMs)在推理和平行解碼方面實現了突破,但其在推理過程中面臨著難以承受的二次方計算複雜度和記憶體開銷。現有的快取技術通過存儲全層狀態來加速解碼,卻帶來了巨大的記憶體使用量,限制了長上下文應用的發展。我們對dLLM中注意力模式的分析揭示了跨層稀疏性的持續存在,其中關鍵詞彙在解碼步驟中保持顯著,而低相關性詞彙則始終不重要,這促使我們考慮選擇性快取淘汰。我們提出了Sparse-dLLM,這是首個無需訓練的框架,它通過延遲雙向稀疏快取將動態快取淘汰與稀疏注意力相結合。利用詞彙顯著性在步驟間的穩定性,該框架保留了關鍵詞彙,並採用注意力引導策略動態淘汰不重要的前綴/後綴條目。在LLaDA和Dream系列上的大量實驗表明,Sparse-dLLM相比於原始dLLM實現了高達10倍的吞吐量提升,同時保持了可比的性能和相似的峰值記憶體成本,在效率和效果上均超越了先前的方法。
English
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and
parallel decoding but suffer from prohibitive quadratic computational
complexity and memory overhead during inference. Current caching techniques
accelerate decoding by storing full-layer states, yet impose substantial memory
usage that limit long-context applications. Our analysis of attention patterns
in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining
salient across decoding steps and low-relevance tokens staying unimportant,
motivating selective cache eviction. We propose Sparse-dLLM, the first
training-free framework integrating dynamic cache eviction with sparse
attention via delayed bidirectional sparse caching. By leveraging the stability
of token saliency over steps, it retains critical tokens and dynamically evicts
unimportant prefix/suffix entries using an attention-guided strategy. Extensive
experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to
10times higher throughput than vanilla dLLMs, with comparable performance
and similar peak memory costs, outperforming previous methods in efficiency and
effectiveness.