哪些注意力头对推理至关重要?基于强化学习的键值缓存压缩
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
October 9, 2025
作者: Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang
cs.AI
摘要
推理型大型語言模型通過擴展的思維鏈生成展現出複雜的推理行為,這在解碼階段造成了前所未有的鍵值(KV)緩存開銷。現有的KV緩存壓縮方法在推理模型上表現欠佳:令牌丟棄方法因捨棄關鍵信息而破壞了推理完整性,而頭部重分配方法則錯誤地壓縮了對推理至關重要的頭部,因為它們是為檢索任務設計的,導致隨著壓縮率的提高,性能顯著下降。我們假設,在推理模型中,KV頭部表現出功能異質性——一些頭部對思維鏈一致性至關重要,而另一些則可壓縮。為驗證並利用這一洞見,我們提出了RLKV,一種新穎的推理關鍵頭部識別框架,它使用強化學習直接優化每個頭部的緩存使用與推理質量之間的關係。由於RLKV在訓練過程中從實際生成的樣本中產生獎勵,它自然識別出與推理行為相關的頭部。隨後,我們為這些頭部分配完整的KV緩存,而對其他頭部應用壓縮的常量KV緩存,以實現高效推理。我們的實驗表明,僅有少數注意力頭部對推理至關重要,這使得我們的KV壓縮方法在實現20-50%緩存減少的同時,相比未壓縮結果,性能幾乎無損,超越了基線方法。
English
Reasoning large language models exhibit complex reasoning behaviors through
the extended chain-of-thought generation, creating unprecedented Key-Value (KV)
cache overhead during the decoding phase. Existing KV cache compression methods
underperform on reasoning models: token-dropping methods break reasoning
integrity by discarding critical information, while head-reallocating methods
mistakenly compress reasoning-critical heads since they are designed for
retrieval tasks, resulting in significant performance degradation as
compression rates increase. We hypothesize that KV heads exhibit functional
heterogeneity in reasoning models-some heads are critical for chain-of-thought
consistency while others are compressible. To validate and exploit this
insight, we propose RLKV, a novel reasoning-critical head identification
framework, which uses reinforcement learning to directly optimize the
relationship between each head's cache usage and reasoning quality. As RLKV
produces rewards from actual generated samples during training, it naturally
identifies heads relevant to reasoning behaviors. We then allocate full KV
cache to these heads while applying compressed constant KV cache to others for
efficient inference. Our experiments reveal that only a small fraction of
attention heads is essential for reasoning, enabling our KV compression
approach to outperform baseline methods while achieving 20-50% cache reduction
with near lossless performance compared to uncompressed results.