ChatPaper.aiChatPaper

哪些注意力头对推理至关重要?基于强化学习的KV缓存压缩

Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

October 9, 2025
作者: Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang
cs.AI

摘要

推理型大语言模型通过扩展的思维链生成展现出复杂的推理行为,这在解码阶段产生了前所未有的键值(KV)缓存开销。现有的KV缓存压缩方法在推理模型上表现欠佳:令牌丢弃方法通过舍弃关键信息破坏了推理完整性,而头部分配重定向方法则因设计初衷为检索任务,错误地压缩了对推理至关重要的注意力头,导致随着压缩率提升,性能显著下降。我们假设在推理模型中,KV头表现出功能异质性——部分头对维持思维链一致性至关重要,而其他头则具备可压缩性。为验证并利用这一洞见,我们提出了RLKV,一种新颖的推理关键头识别框架,它运用强化学习直接优化每个头的缓存使用与推理质量之间的关系。由于RLKV在训练过程中从实际生成的样本中产生奖励,它自然能识别与推理行为相关的头。随后,我们为这些头分配完整的KV缓存,而对其他头应用压缩的恒定KV缓存,以实现高效推理。实验表明,仅有少量注意力头对推理至关重要,这使得我们的KV压缩方法在实现20-50%缓存缩减的同时,相比未压缩结果,性能几乎无损,超越了基线方法。
English
Reasoning large language models exhibit complex reasoning behaviors through the extended chain-of-thought generation, creating unprecedented Key-Value (KV) cache overhead during the decoding phase. Existing KV cache compression methods underperform on reasoning models: token-dropping methods break reasoning integrity by discarding critical information, while head-reallocating methods mistakenly compress reasoning-critical heads since they are designed for retrieval tasks, resulting in significant performance degradation as compression rates increase. We hypothesize that KV heads exhibit functional heterogeneity in reasoning models-some heads are critical for chain-of-thought consistency while others are compressible. To validate and exploit this insight, we propose RLKV, a novel reasoning-critical head identification framework, which uses reinforcement learning to directly optimize the relationship between each head's cache usage and reasoning quality. As RLKV produces rewards from actual generated samples during training, it naturally identifies heads relevant to reasoning behaviors. We then allocate full KV cache to these heads while applying compressed constant KV cache to others for efficient inference. Our experiments reveal that only a small fraction of attention heads is essential for reasoning, enabling our KV compression approach to outperform baseline methods while achieving 20-50% cache reduction with near lossless performance compared to uncompressed results.
PDF212October 13, 2025