ChatPaper.aiChatPaper

注意力机制即是一切:扩散大语言模型中的键值缓存优化

Attention Is All You Need for KV Cache in Diffusion LLMs

October 16, 2025
作者: Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen
cs.AI

摘要

本研究探討如何針對擴散式大型語言模型(DLMs)自適應地重新計算鍵值(KV)緩存,以在最大化預測準確性的同時最小化解碼延遲。先前方法中的解碼器在每個去噪步驟和層級上為所有令牌重新計算QKV,儘管KV狀態在大多數步驟中變化甚微,特別是在淺層中,導致大量冗餘。我們提出三點觀察:(1) 遠距離的{bf MASK}令牌主要作為長度偏差,可在活躍預測窗口之外進行塊級緩存;(2) KV動態性隨深度增加,表明從更深層開始選擇性刷新已足夠;(3) 最受關注的令牌展現出最小的KV漂移,為其他令牌的緩存變化提供了保守的下限。基於這些觀察,我們提出了{bf Elastic-Cache},這是一種無需訓練、與架構無關的策略,它聯合決定{何時}刷新(通過對最受關注令牌的注意力感知漂移測試)和{何處}刷新(通過深度感知的計劃,從選定層開始重新計算,同時重用淺層緩存和窗口外的MASK緩存)。與固定週期方案不同,Elastic-Cache為擴散式LLMs執行自適應的層級感知緩存更新,減少冗餘計算並加速解碼,且生成質量損失可忽略不計。在LLaDA-Instruct、LLaDA-1.5和LLaDA-V上進行的數學推理和代碼生成任務實驗顯示出一致的加速效果:在GSM8K(256令牌)上達到8.7倍,在更長序列上達到45.1倍,在HumanEval上達到4.8倍,同時始終保持比基線更高的準確性。我們的方法在保持生成質量的同時,實現了比現有基於置信度的方法顯著更高的吞吐量(在GSM8K上達到6.8倍),從而促進了擴散式LLMs的實際部署。
English
This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant {bf MASK} tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose {bf Elastic-Cache}, a training-free, architecture-agnostic strategy that jointly decides {when} to refresh (via an attention-aware drift test on the most-attended token) and {where} to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: 8.7times on GSM8K (256 tokens), 45.1times on longer sequences, and 4.8times on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput (6.8times on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.
PDF352October 17, 2025