KVパケット：LLM向け再計算不要・文脈非依存KVキャッシング

要旨

大規模言語モデル（LLM）は、推論遅延を最小化するためにKey-Value（KV）キャッシュに大きく依存している。しかし、標準的なKVキャッシュは文脈依存性を持つ。すなわち、キャッシュされた文書を新しい文脈で再利用するには、注意分布の変化を考慮するためにKV状態を再計算する必要がある。CacheBlend、EPIC、SAM-KVなどの既存の解決策は、トークンの一部を選択的に再計算することでこの問題を軽減するが、依然として無視できない計算量（FLOPs）のオーバーヘッドと、初回トークン出力までの遅延（TTFT）の増加が生じる。本論文では、KV Packetを提案する。これは、キャッシュされた文書を不変の「パケット」として扱い、軽量な学習可能なソフトトークンアダプタで包むことで、再計算を不要とするキャッシュ再利用フレームワークである。これらのアダプタは自己教師あり蒸留により学習され、文脈の不連続性を橋渡しする。Llama-3.1およびQwen2.5を用いた実験により、提案するKV Packet手法が、再計算ベースのベースラインと比較して、ほぼゼロのFLOPsと低いTTFTを達成しつつ、完全再計算ベースラインに匹敵するF1スコアを維持することを実証する。

English

Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution. Existing solutions such as CacheBlend, EPIC, and SAM-KV mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs) and increased Time-to-First-Token (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable ``packets'' wrapped in light-weight trainable soft-token adapters, which are trained via self-supervised distillation to bridge context discontinuities. Experiments on Llama-3.1 and Qwen2.5 demonstrate that the proposed KV Packet method achieves near-zero FLOPs and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.

KVパケット：LLM向け再計算不要・文脈非依存KVキャッシング

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

要旨

Support