ChatPaper.aiChatPaper

SnapKV:在生成之前,LLM知道您要找的是什么

SnapKV: LLM Knows What You are Looking for Before Generation

April 22, 2024
作者: Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen
cs.AI

摘要

大型语言模型(LLMs)在处理广泛语境方面取得了显著进展,其中键-值(KV)缓存在提升性能方面发挥着至关重要的作用。然而,随着输入长度的增加,KV缓存的增长对内存和时间效率提出了挑战。为了解决这一问题,本文引入了SnapKV,这是一种创新的、无需微调的方法,可以在保持在实际应用中可比性能的同时,高效地最小化KV缓存大小。 我们发现模型中的每个注意力头在生成过程中始终专注于特定提示注意特征。同时,这种稳健的模式可以从位于提示末尾的“观察”窗口中获得。基于这一洞察力,SnapKV通过为每个注意力头选择聚类的重要KV位置自动压缩KV缓存。我们的方法显著减少了处理长输入序列时不断增长的计算开销和内存占用。具体来说,与基准相比,SnapKV在处理包含16K标记的输入时实现了一致的解码速度,生成速度提高了3.6倍,内存效率提高了8.2倍。同时,在跨16个长序列数据集上,与基准模型保持了可比的性能。此外,SnapKV可以在单个A100-80GB GPU上处理高达380K上下文标记,使用HuggingFace实现并进行了轻微更改,仅在“草堆中的针”测试中表现出可忽略的准确性下降。进一步的综合研究表明了SnapKV在实际应用中的潜力。
English
Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an `observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when processing long input sequences. Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to baseline when processing inputs of 16K tokens. At the same time, it maintains comparable performance to baseline models across 16 long sequence datasets. Moreover, SnapKV can process up to 380K context tokens on a single A100-80GB GPU using HuggingFace implementation with minor changes, exhibiting only a negligible accuracy drop in the Needle-in-a-Haystack test. Further comprehensive studies suggest SnapKV's potential for practical applications.

Summary

AI-Generated Summary

PDF272December 15, 2024