LeanK：面向高效解码的可学习K缓存通道剪枝

摘要

大型语言模型（LLMs）能够处理长上下文任务，但由于不断增长的键值（KV）缓存，面临效率挑战。我们提出了LeanK，一种基于学习的方法，通过利用静态通道稀疏性来修剪不重要的键（K）缓存通道。通过新颖的两阶段训练过程，LeanK学习通道级别的静态掩码，能够满足特定的稀疏率与硬件对齐要求。LeanK在保持准确性的同时，减少了GPU内存并加速了解码过程。实验表明，K缓存最多可减少70%，V缓存内存减少16%-18%。定制的解码内核使注意力计算速度提升1.3倍。我们还通过分析学习到的重要性分布，深入探讨了长上下文推理过程中模型通道与注意力头的行为。我们的代码可在https://aka.ms/LeanK获取。

English

Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific sparsity ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at https://aka.ms/LeanK.

LeanK：面向高效解码的可学习K缓存通道剪枝

LeanK: Learnable K Cache Channel Pruning for Efficient Decoding

摘要

Support