ThinK: 透過查詢驅動修剪實現更薄的金鑰快取
ThinK: Thinner Key Cache by Query-Driven Pruning
July 30, 2024
作者: Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo
cs.AI
摘要
大型語言模型(LLMs)已經在自然語言處理領域引起了革命,通過利用增加的模型大小和序列長度,在各種應用中取得了前所未有的性能。然而,伴隨而來的計算和記憶成本的上升對於管理長序列提出了重大挑戰,特別是由於變壓器注意機制的二次複雜度。本文專注於長上下文情況,解決了推論過程中KV緩存內存消耗的效率問題。與現有方法優化基於序列長度的記憶不同,我們發現KV緩存的通道維度存在顯著的冗餘,表現為注意權重中不平衡的幅度分佈和低秩結構。基於這些觀察,我們提出了ThinK,一種新穎的基於查詢的KV緩存剪枝方法,旨在在有選擇性地剪枝最不重要的通道的同時最小化注意權重損失。我們的方法不僅保持或提高了模型的準確性,而且與普通的KV緩存淘汰方法相比,記憶成本降低了超過20%。在LLaMA3和Mistral模型上進行了廣泛評估,涵蓋了各種長序列數據集,證實了ThinK的有效性,為在不影響性能的情況下實現高效LLM部署設立了新的標竿。我們還概述了將我們的方法擴展到值緩存剪枝的潛力,展示了ThinK在減少記憶和計算開銷方面的多功能性和廣泛應用性。
English
Large Language Models (LLMs) have revolutionized the field of natural
language processing, achieving unprecedented performance across a variety of
applications by leveraging increased model sizes and sequence lengths. However,
the associated rise in computational and memory costs poses significant
challenges, particularly in managing long sequences due to the quadratic
complexity of the transformer attention mechanism. This paper focuses on the
long-context scenario, addressing the inefficiencies in KV cache memory
consumption during inference. Unlike existing approaches that optimize the
memory based on the sequence lengths, we uncover that the channel dimension of
the KV cache exhibits significant redundancy, characterized by unbalanced
magnitude distribution and low-rank structure in attention weights. Based on
these observations, we propose ThinK, a novel query-dependent KV cache pruning
method designed to minimize attention weight loss while selectively pruning the
least significant channels. Our approach not only maintains or enhances model
accuracy but also achieves a reduction in memory costs by over 20% compared
with vanilla KV cache eviction methods. Extensive evaluations on the LLaMA3 and
Mistral models across various long-sequence datasets confirm the efficacy of
ThinK, setting a new precedent for efficient LLM deployment without
compromising performance. We also outline the potential of extending our method
to value cache pruning, demonstrating ThinK's versatility and broad
applicability in reducing both memory and computational overheads.Summary
AI-Generated Summary