鹦鹉：基于提示的键-值缓存压缩

摘要

最近大型语言模型应用，如检索增强生成和聊天机器人，导致了对处理更长输入上下文的需求增加。然而，这一需求受到固有限制的阻碍。在架构上，模型受训练期间定义的上下文窗口的限制。此外，处理大量文本需要大量的GPU内存。我们提出了一种新方法，Finch，通过利用自注意力预训练模型权重来压缩输入上下文。给定一个提示和一段长文本，Finch迭代地识别在提示条件下文本块上最相关的键（K）和值（V）对。只有这些对被存储在KV缓存中，最终在上下文窗口限制的空间内包含长文本的压缩版本。我们的提议使模型能够消耗大量输入，即使进行高度压缩（高达93倍）也能保持语义完整性，而无需进行微调。

English

Recent large language model applications, such as Retrieval-Augmented Generation and chatbots, have led to an increased need to process longer input contexts. However, this requirement is hampered by inherent limitations. Architecturally, models are constrained by a context window defined during training. Additionally, processing extensive texts requires substantial GPU memory. We propose a novel approach, Finch, to compress the input context by leveraging the pre-trained model weights of the self-attention. Given a prompt and a long text, Finch iteratively identifies the most relevant Key (K) and Value (V) pairs over chunks of the text conditioned on the prompt. Only such pairs are stored in the KV cache, which, within the space constrained by the context window, ultimately contains a compressed version of the long text. Our proposal enables models to consume large inputs even with high compression (up to 93x) while preserving semantic integrity without the need for fine-tuning.

鹦鹉：基于提示的键-值缓存压缩

Finch: Prompt-guided Key-Value Cache Compression

摘要

Support