LeanK:面向高效解码的可学习K缓存通道剪枝
LeanK: Learnable K Cache Channel Pruning for Efficient Decoding
August 4, 2025
作者: Yike Zhang, Zhiyuan He, Huiqiang Jiang, Chengruidong Zhang, Yuqing Yang, Jianyong Wang, Lili Qiu
cs.AI
摘要
大型语言模型(LLMs)能够处理长上下文任务,但由于不断增长的键值(KV)缓存,面临效率挑战。我们提出了LeanK,一种基于学习的方法,通过利用静态通道稀疏性来修剪不重要的键(K)缓存通道。通过新颖的两阶段训练过程,LeanK学习通道级别的静态掩码,能够满足特定的稀疏率与硬件对齐要求。LeanK在保持准确性的同时,减少了GPU内存并加速了解码过程。实验表明,K缓存最多可减少70%,V缓存内存减少16%-18%。定制的解码内核使注意力计算速度提升1.3倍。我们还通过分析学习到的重要性分布,深入探讨了长上下文推理过程中模型通道与注意力头的行为。我们的代码可在https://aka.ms/LeanK获取。
English
Large language models (LLMs) enable long-context tasks but face efficiency
challenges due to the growing key-value (KV) cache. We propose LeanK, a
learning-based method that prunes unimportant key (K) cache channels by
leveraging static channel sparsity. With a novel two-stage training process,
LeanK learns channel-wise static mask that could satisfy specific sparsity
ratio and hardware alignment requirement. LeanK reduces GPU memory and
accelerates decoding without sacrificing accuracy. Experiments demonstrate up
to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel
enables 1.3x speedup for attention computation. We also provide insights into
model channels and attention heads during long-context inference by analyzing
the learned importance distribution. Our code is available at
https://aka.ms/LeanK.