鹦鹉:基于提示的键-值缓存压缩
Finch: Prompt-guided Key-Value Cache Compression
July 31, 2024
作者: Giulio Corallo, Paolo Papotti
cs.AI
摘要
最近大型语言模型应用,如检索增强生成和聊天机器人,导致了对处理更长输入上下文的需求增加。然而,这一需求受到固有限制的阻碍。在架构上,模型受训练期间定义的上下文窗口的限制。此外,处理大量文本需要大量的GPU内存。我们提出了一种新方法,Finch,通过利用自注意力预训练模型权重来压缩输入上下文。给定一个提示和一段长文本,Finch迭代地识别在提示条件下文本块上最相关的键(K)和值(V)对。只有这些对被存储在KV缓存中,最终在上下文窗口限制的空间内包含长文本的压缩版本。我们的提议使模型能够消耗大量输入,即使进行高度压缩(高达93倍)也能保持语义完整性,而无需进行微调。
English
Recent large language model applications, such as Retrieval-Augmented
Generation and chatbots, have led to an increased need to process longer input
contexts. However, this requirement is hampered by inherent limitations.
Architecturally, models are constrained by a context window defined during
training. Additionally, processing extensive texts requires substantial GPU
memory. We propose a novel approach, Finch, to compress the input context by
leveraging the pre-trained model weights of the self-attention. Given a prompt
and a long text, Finch iteratively identifies the most relevant Key (K) and
Value (V) pairs over chunks of the text conditioned on the prompt. Only such
pairs are stored in the KV cache, which, within the space constrained by the
context window, ultimately contains a compressed version of the long text. Our
proposal enables models to consume large inputs even with high compression (up
to 93x) while preserving semantic integrity without the need for fine-tuning.Summary
AI-Generated Summary