SnapKV: LLMは生成前にあなたが探しているものを知っている

要旨

大規模言語モデル（LLMs）は、広範なコンテキストの処理において顕著な進歩を遂げており、その性能向上においてKey-Value（KV）キャッシュが重要な役割を果たしています。しかし、入力長の増加に伴うKVキャッシュの拡大は、メモリと時間効率に課題をもたらします。この問題に対処するため、本論文ではSnapKVを紹介します。これは、実世界のアプリケーションにおいて同等の性能を維持しながら、KVキャッシュサイズを効率的に最小化する、ファインチューニングを必要としない革新的なアプローチです。モデル内の各アテンションヘッドは、生成中に特定のプロンプトアテンション特徴に一貫して焦点を当てることがわかっています。一方で、この強力なパターンは、プロンプトの末尾にある「観測」ウィンドウから得ることができます。この洞察に基づき、SnapKVは、各アテンションヘッドに対してクラスタ化された重要なKV位置を選択することで、KVキャッシュを自動的に圧縮します。このアプローチにより、長い入力シーケンスを処理する際の計算オーバーヘッドとメモリフットプリントの増大を大幅に削減します。具体的には、SnapKVは、16Kトークンの入力を処理する際に、ベースラインと比較して3.6倍の生成速度の向上と8.2倍のメモリ効率の向上を実現し、一貫したデコード速度を達成します。同時に、16の長いシーケンスデータセットにおいて、ベースラインモデルと同等の性能を維持します。さらに、SnapKVは、HuggingFaceの実装にわずかな変更を加えることで、単一のA100-80GB GPU上で最大380Kのコンテキストトークンを処理でき、Needle-in-a-Haystackテストにおいても精度の低下はごくわずかです。さらなる包括的な研究は、SnapKVの実用的なアプリケーションへの可能性を示唆しています。

English

Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an `observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when processing long input sequences. Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to baseline when processing inputs of 16K tokens. At the same time, it maintains comparable performance to baseline models across 16 long sequence datasets. Moreover, SnapKV can process up to 380K context tokens on a single A100-80GB GPU using HuggingFace implementation with minor changes, exhibiting only a negligible accuracy drop in the Needle-in-a-Haystack test. Further comprehensive studies suggest SnapKV's potential for practical applications.

SnapKV: LLMは生成前にあなたが探しているものを知っている

SnapKV: LLM Knows What You are Looking for Before Generation

要旨

Summary

Support

Support