ThinK: 通过查询驱动的修剪实现更薄的密钥缓存
ThinK: Thinner Key Cache by Query-Driven Pruning
July 30, 2024
作者: Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo
cs.AI
摘要
大型语言模型(LLMs)已经彻底改变了自然语言处理领域,通过利用增加的模型大小和序列长度,在各种应用中取得了前所未有的性能。然而,随之而来的计算和内存成本的上升带来了重大挑战,特别是在管理长序列时,由于Transformer注意力机制的二次复杂度。本文关注长上下文情况,解决了推理过程中KV缓存内存消耗的低效率问题。与现有方法优化基于序列长度的内存不同,我们发现KV缓存的通道维度存在显著的冗余,表现为注意力权重中不平衡的幅度分布和低秩结构。基于这些观察,我们提出了ThinK,一种新颖的基于查询的KV缓存修剪方法,旨在在有选择地修剪最不显著的通道的同时最小化注意力权重损失。我们的方法不仅保持或提升了模型的准确性,而且与普通的KV缓存驱逐方法相比,内存成本减少了超过20%。在LLaMA3和Mistral模型上对各种长序列数据集进行了广泛评估,证实了ThinK的有效性,为高效部署LLM树立了新的先例,而不会影响性能。我们还概述了将我们的方法扩展到值缓存修剪的潜力,展示了ThinK在减少内存和计算开销方面的多功能性和广泛适用性。
English
Large Language Models (LLMs) have revolutionized the field of natural
language processing, achieving unprecedented performance across a variety of
applications by leveraging increased model sizes and sequence lengths. However,
the associated rise in computational and memory costs poses significant
challenges, particularly in managing long sequences due to the quadratic
complexity of the transformer attention mechanism. This paper focuses on the
long-context scenario, addressing the inefficiencies in KV cache memory
consumption during inference. Unlike existing approaches that optimize the
memory based on the sequence lengths, we uncover that the channel dimension of
the KV cache exhibits significant redundancy, characterized by unbalanced
magnitude distribution and low-rank structure in attention weights. Based on
these observations, we propose ThinK, a novel query-dependent KV cache pruning
method designed to minimize attention weight loss while selectively pruning the
least significant channels. Our approach not only maintains or enhances model
accuracy but also achieves a reduction in memory costs by over 20% compared
with vanilla KV cache eviction methods. Extensive evaluations on the LLaMA3 and
Mistral models across various long-sequence datasets confirm the efficacy of
ThinK, setting a new precedent for efficient LLM deployment without
compromising performance. We also outline the potential of extending our method
to value cache pruning, demonstrating ThinK's versatility and broad
applicability in reducing both memory and computational overheads.Summary
AI-Generated Summary