注意力机制是扩散大语言模型中键值缓存的全部所需
Attention Is All You Need for KV Cache in Diffusion LLMs
October 16, 2025
作者: Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen
cs.AI
摘要
本研究探讨了如何自适应地重新计算扩散大语言模型(DLMs)中的键值(KV)缓存,以在最小化解码延迟的同时最大化预测准确性。现有方法的解码器在每一步去噪和每一层中都会为所有令牌重新计算QKV,尽管KV状态在大多数步骤中变化甚微,尤其是在浅层,这导致了大量冗余。我们提出了三点观察:(1)远距离的{bf MASK}令牌主要起到长度偏差的作用,可以在活动预测窗口之外进行块级缓存;(2)KV动态性随深度增加,表明从深层开始选择性刷新已足够;(3)最受关注的令牌表现出最小的KV漂移,为其他令牌的缓存变化提供了一个保守的下限。基于这些观察,我们提出了{bf Elastic-Cache},一种无需训练、与架构无关的策略,它联合决定{何时}刷新(通过对最受关注令牌的注意力感知漂移测试)和{何处}刷新(通过深度感知调度,从选定层开始重新计算,同时重用浅层缓存和窗口外的MASK缓存)。与固定周期方案不同,Elastic-Cache为扩散大语言模型执行自适应的、层级感知的缓存更新,减少了冗余计算并加速了解码,且生成质量损失可忽略不计。在LLaDA-Instruct、LLaDA-1.5和LLaDA-V上的数学推理和代码生成任务实验中,Elastic-Cache展示了持续的加速效果:在GSM8K(256令牌)上达到8.7倍,在更长序列上达到45.1倍,在HumanEval上达到4.8倍,同时始终保持着比基线更高的准确性。我们的方法在保持生成质量的同时,实现了比现有基于置信度的方法显著更高的吞吐量(在GSM8K上为6.8倍),使得扩散大语言模型的实际部署成为可能。
English
This work studies how to adaptively recompute key-value (KV) caches for
diffusion large language models (DLMs) to maximize prediction accuracy while
minimizing decoding latency. Prior methods' decoders recompute QKV for all
tokens at every denoising step and layer, despite KV states changing little
across most steps, especially in shallow layers, leading to substantial
redundancy. We make three observations: (1) distant {bf MASK} tokens
primarily act as a length-bias and can be cached block-wise beyond the active
prediction window; (2) KV dynamics increase with depth, suggesting that
selective refresh starting from deeper layers is sufficient; and (3) the
most-attended token exhibits the smallest KV drift, providing a conservative
lower bound on cache change for other tokens. Building on these, we propose
{bf Elastic-Cache}, a training-free, architecture-agnostic strategy that
jointly decides {when} to refresh (via an attention-aware drift test on the
most-attended token) and {where} to refresh (via a depth-aware schedule that
recomputes from a chosen layer onward while reusing shallow-layer caches and
off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs
adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant
computation and accelerating decoding with negligible loss in generation
quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across
mathematical reasoning and code generation tasks demonstrate consistent
speedups: 8.7times on GSM8K (256 tokens), 45.1times on longer sequences,
and 4.8times on HumanEval, while consistently maintaining higher accuracy
than the baseline. Our method achieves significantly higher throughput
(6.8times on GSM8K) than existing confidence-based approaches while
preserving generation quality, enabling practical deployment of diffusion LLMs.