Finch: プロンプト誘導型キー・バリューキャッシュ圧縮

要旨

最近の大規模言語モデルアプリケーション、例えばRetrieval-Augmented Generationやチャットボットなどでは、より長い入力コンテキストを処理する必要性が高まっています。しかし、この要求は本質的な制約によって妨げられています。アーキテクチャ的には、モデルはトレーニング中に定義されたコンテキストウィンドウによって制限されています。さらに、広範なテキストを処理するには大量のGPUメモリが必要です。我々は、事前学習済みの自己注意機構の重みを活用して入力コンテキストを圧縮する新しいアプローチ、Finchを提案します。プロンプトと長いテキストが与えられた場合、Finchはプロンプトに基づいてテキストのチャンクごとに最も関連性の高いKey (K)とValue (V)のペアを反復的に特定します。そのようなペアのみがKVキャッシュに保存され、コンテキストウィンドウによって制約された空間内で、最終的には長いテキストの圧縮版が含まれます。我々の提案により、モデルは高圧縮率（最大93倍）でも意味的整合性を保ちながら、ファインチューニングを必要とせずに大きな入力を消費できるようになります。

English

Recent large language model applications, such as Retrieval-Augmented Generation and chatbots, have led to an increased need to process longer input contexts. However, this requirement is hampered by inherent limitations. Architecturally, models are constrained by a context window defined during training. Additionally, processing extensive texts requires substantial GPU memory. We propose a novel approach, Finch, to compress the input context by leveraging the pre-trained model weights of the self-attention. Given a prompt and a long text, Finch iteratively identifies the most relevant Key (K) and Value (V) pairs over chunks of the text conditioned on the prompt. Only such pairs are stored in the KV cache, which, within the space constrained by the context window, ultimately contains a compressed version of the long text. Our proposal enables models to consume large inputs even with high compression (up to 93x) while preserving semantic integrity without the need for fine-tuning.

Finch: プロンプト誘導型キー・バリューキャッシュ圧縮

Finch: Prompt-guided Key-Value Cache Compression

要旨

Support