ChunkKV: 長文脈LMM推論の効率的なための意味保存KVキャッシュ圧縮

要旨

大規模言語モデル（LLM）における長いコンテキスト推論におけるメモリコストを削減するために、最近の多くの研究は、異なるトークンのキー・バリュー（KV）キャッシュを圧縮することに焦点を当てています。しかし、我々は、以前のKVキャッシュ圧縮手法がトークンの重要性を個別に測定し、現実世界の言語特性における異なるトークン間の依存関係を無視していることを特定しました。この点を考慮して、我々はChunkKVを導入し、チャンク内のトークンを基本的な圧縮単位としてグループ化し、より情報量の多い意味的なチャンクを保持しつつ、より重要でないものを破棄します。さらに、ChunkKVが異なるレイヤー間で保存されたインデックスにおいて高い類似性を示すことに着目し、計算オーバーヘッドをさらに削減するためにレイヤーごとのインデックス再利用を提案します。我々は、LongBenchやNeedle-In-A-HayStackを含む最先端の長いコンテキストベンチマーク、およびGSM8KやJailbreakVのインコンテキスト学習ベンチマークでChunkKVを評価しました。我々の実験では、指示チューニングと多段階推論（O1およびR1）LLMにおいて、既存の手法と比較して積極的な圧縮率で最大10\%の性能向上を達成しました。

English

To reduce memory costs in long-context inference with Large Language Models (LLMs), many recent works focus on compressing the key-value (KV) cache of different tokens. However, we identify that the previous KV cache compression methods measure token importance individually, neglecting the dependency between different tokens in the real-world language characterics. In light of this, we introduce ChunkKV, grouping the tokens in a chunk as a basic compressing unit, and retaining the most informative semantic chunks while discarding the less important ones. Furthermore, observing that ChunkKV exhibits higher similarity in the preserved indices across different layers, we propose layer-wise index reuse to further reduce computational overhead. We evaluated ChunkKV on cutting-edge long-context benchmarks including LongBench and Needle-In-A-HayStack, as well as the GSM8K and JailbreakV in-context learning benchmark. Our experiments with instruction tuning and multi-step reasoning (O1 and R1) LLMs, achieve up to 10\% performance improvement under aggressive compression ratios compared to existing methods.

ChunkKV: 長文脈LMM推論の効率的なための意味保存KVキャッシュ圧縮

ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

要旨

Support