ChatPaper.aiChatPaper

快速KV压缩:基于门控KV淘汰机制的高效精准大语言模型推理

Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

January 25, 2026
作者: Jang-Hyun Kim, Dongyoon Han, Sangdoo Yun
cs.AI

摘要

高效键值(KV)缓存管理对大语言模型(LLM)的实际部署至关重要,然而现有压缩技术往往需要在性能损失与计算开销之间进行权衡。我们提出一种基于门控的KV缓存淘汰新方法,适用于冻结权重的LLM,能以可忽略的计算成本实现高压缩比。该方法通过轻量化的汇聚注意力门控模块识别并保留关键KV对,并无缝集成至预填充和解码阶段。所提出的门控训练算法仅依赖LLM的前向传播,避免了昂贵的反向传播过程,同时通过任务无关的重建目标实现强大的任务泛化能力。在Qwen2.5-1M、Qwen3和Gemma3系列模型上的大量实验表明,本方法在淘汰高达70% KV缓存的同时仍能保持近无损性能。该结果在长文本理解、代码解析和数学推理等多样化任务中表现一致,证明了方法的普适性。
English
Efficient key-value (KV) cache management is crucial for the practical deployment of large language models (LLMs), yet existing compression techniques often incur a trade-off between performance degradation and computational overhead. We propose a novel gating-based KV cache eviction method for frozen-weight LLMs that achieves high compression ratios with negligible computational cost. Our approach introduces lightweight sink-attention gating modules to identify and retain critical KV pairs, and integrates seamlessly into both the prefill and decoding stages. The proposed gate training algorithm relies on forward passes of an LLM, avoiding expensive backpropagation, while achieving strong task generalization through a task-agnostic reconstruction objective. Extensive experiments across the Qwen2.5-1M, Qwen3, and Gemma3 families show that our method maintains near-lossless performance while evicting up to 70% of the KV cache. The results are consistent across a wide range of tasks, including long-context understanding, code comprehension, and mathematical reasoning, demonstrating the generality of our approach.
PDF42February 3, 2026