ShadowKV: 高スループット長文脈LLM推論のための影におけるKVキャッシュ

要旨

長いコンテキストを持つ大規模言語モデル（LLM）の広範な展開に伴い、高スループット推論の効率的なサポートへの需要が高まっています。ただし、シーケンス長に応じてキー値（KV）キャッシュが拡大すると、増加するメモリフットプリントと各トークン生成時のアクセスの必要性により、長いコンテキストLLMのサービス時に低いスループットが生じます。生成品質を維持しながら推論を高速化するためにさまざまな動的スパースアテンション手法が提案されていますが、GPUメモリ消費を十分に削減できないか、KVキャッシュをCPUにオフロードすることで著しいデコーディング遅延を導入してしまいます。本研究では、低ランクキーキャッシュを保存し、メモリフットプリントを削減するために値キャッシュをオフロードする高スループット長いコンテキストLLM推論システムであるShadowKVを提案します。デコーディング遅延を最小限に抑えるため、ShadowKVはオンザフライで最小限のスパースKVペアを再構築する正確なKV選択戦略を採用しています。RULER、LongBench、Needle In A Haystackなどの幅広いベンチマークやLlama-3.1-8B、Llama-3-8B-1M、GLM-4-9B-1M、Yi-9B-200K、Phi-3-Mini-128K、Qwen2-7B-128KなどのモデルでShadowKVを評価することで、無限のGPUメモリを前提とした無限のバッチサイズで達成可能なパフォーマンスを上回ることなく、A100 GPU上で最大6倍のバッチサイズをサポートし、スループットを最大3.04倍向上させることが示されました。コードはhttps://github.com/bytedance/ShadowKVで入手可能です。

English

With the widespread deployment of long-context large language models (LLMs), there has been a growing demand for efficient support of high-throughput inference. However, as the key-value (KV) cache expands with the sequence length, the increasing memory footprint and the need to access it for each token generation both result in low throughput when serving long-context LLMs. While various dynamic sparse attention methods have been proposed to speed up inference while maintaining generation quality, they either fail to sufficiently reduce GPU memory consumption or introduce significant decoding latency by offloading the KV cache to the CPU. We present ShadowKV, a high-throughput long-context LLM inference system that stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences. To minimize decoding latency, ShadowKV employs an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly. By evaluating ShadowKV on a broad range of benchmarks, including RULER, LongBench, and Needle In A Haystack, and models like Llama-3.1-8B, Llama-3-8B-1M, GLM-4-9B-1M, Yi-9B-200K, Phi-3-Mini-128K, and Qwen2-7B-128K, we demonstrate that it can support up to 6times larger batch sizes and boost throughput by up to 3.04times on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory. The code is available at https://github.com/bytedance/ShadowKV.

ShadowKV: 高スループット長文脈LLM推論のための影におけるKVキャッシュ

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

要旨

Support