Fast KVzip:基於門控KV驅逐機制的高效精準大型語言模型推論
Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction
January 25, 2026
作者: Jang-Hyun Kim, Dongyoon Han, Sangdoo Yun
cs.AI
摘要
高效鍵值(KV)快取管理對於大型語言模型(LLM)的實際部署至關重要,然而現有的壓縮技術往往需要在性能下降與計算開銷之間進行權衡。我們提出一種基於門控機制的KV快取淘汰方法,針對凍結權重的LLM實現高壓縮比且計算成本可忽略不計。該方法通過輕量級的匯聚注意力門控模組識別並保留關鍵KV對,並無縫整合至預填充和解碼階段。所提出的門控訓練演算法僅依賴LLM的前向傳播,避免昂貴的反向傳播過程,同時透過任務無關的重建目標實現強大的任務泛化能力。在Qwen2.5-1M、Qwen3和Gemma3系列模型上的大量實驗表明,本方法在淘汰高達70% KV快取的同時仍能保持近乎無損的性能。該結果在長上下文理解、代碼解析與數學推理等多類任務中均保持一致,證明了方法的普適性。
English
Efficient key-value (KV) cache management is crucial for the practical deployment of large language models (LLMs), yet existing compression techniques often incur a trade-off between performance degradation and computational overhead. We propose a novel gating-based KV cache eviction method for frozen-weight LLMs that achieves high compression ratios with negligible computational cost. Our approach introduces lightweight sink-attention gating modules to identify and retain critical KV pairs, and integrates seamlessly into both the prefill and decoding stages. The proposed gate training algorithm relies on forward passes of an LLM, avoiding expensive backpropagation, while achieving strong task generalization through a task-agnostic reconstruction objective. Extensive experiments across the Qwen2.5-1M, Qwen3, and Gemma3 families show that our method maintains near-lossless performance while evicting up to 70% of the KV cache. The results are consistent across a wide range of tasks, including long-context understanding, code comprehension, and mathematical reasoning, demonstrating the generality of our approach.