LLMは、KVキャッシュ圧縮下で基本的な能力を維持できるか？

要旨

本論文は、大規模言語モデル（LLMs）における未探索の課題に焦点を当てています：KVキャッシュ圧縮方法がLLMsの基本的な機能に与える影響です。既存の方法は、長い文脈のベンチマークで印象的な圧縮率を達成していますが、それらがモデルの中核的な機能に与える影響は未だ研究されていません。我々は、世界知識、常識的推論、算術的推論、コード生成、安全性、長い文脈の理解と生成を含む多様なタスクにわたり、優れたKVキャッシュ圧縮方法を評価する包括的な経験的研究を提供します。我々の分析により、KVキャッシュ圧縮方法はタスク固有の性能劣化を示すことが明らかになりました。算術的推論タスクは、積極的な圧縮に特に敏感であり、異なる方法によって17.4%〜43.3%の性能低下が示されます。特筆すべきは、DeepSeek R1 Distillモデルが、指示に調整されたモデルと比較してより堅牢な圧縮耐性を示し、わずか9.67%〜25.53%の性能低下を示すことです。我々の注意パターンとクロスタスクの圧縮性能の分析に基づき、我々はShotKVを提案します。これは、プリフィルとデコードフェーズを明確に処理し、ショットレベルの意味的一貫性を維持する新しい圧縮手法です。経験的結果は、ShotKVが積極的な圧縮率下で長い文脈生成タスクで9%〜18%の性能向上を達成していることを示しています。

English

This paper investigates an under-explored challenge in large language models (LLMs): the impact of KV cache compression methods on LLMs' fundamental capabilities. While existing methods achieve impressive compression ratios on long-context benchmarks, their effects on core model capabilities remain understudied. We present a comprehensive empirical study evaluating prominent KV cache compression methods across diverse tasks, spanning world knowledge, commonsense reasoning, arithmetic reasoning, code generation, safety, and long-context understanding and generation.Our analysis reveals that KV cache compression methods exhibit task-specific performance degradation. Arithmetic reasoning tasks prove particularly sensitive to aggressive compression, with different methods showing performance drops of 17.4%-43.3%. Notably, the DeepSeek R1 Distill model exhibits more robust compression tolerance compared to instruction-tuned models, showing only 9.67%-25.53% performance degradation. Based on our analysis of attention patterns and cross-task compression performance, we propose ShotKV, a novel compression approach that distinctly handles prefill and decoding phases while maintaining shot-level semantic coherence. Empirical results show that ShotKV achieves 9%-18% performance improvements on long-context generation tasks under aggressive compression ratios.

LLMは、KVキャッシュ圧縮下で基本的な能力を維持できるか？

Can LLMs Maintain Fundamental Abilities under KV Cache Compression?

要旨

Support