一種簡單而有效的基於 L_2 范數的 KV 快取壓縮策略
A Simple and Effective L_2 Norm-Based Strategy for KV Cache Compression
June 17, 2024
作者: Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini
cs.AI
摘要
大型語言模型(LLMs)的部署常常受到關鍵-值(KV)快取的大量記憶體需求的阻礙,尤其是在上下文長度增加時。現有的減少KV快取大小的方法包括微調模型以學習壓縮策略,或利用注意力分數來減少序列長度。我們分析了僅包含解碼器的基於Transformer的模型中的注意力分佈,並觀察到在大多數層中,注意力分配模式保持一致。令人驚訝的是,我們發現在緩存的KV對中,鍵的嵌入的L_2和注意力分數之間存在明顯的相關性,鍵嵌入的低L_2通常導致解碼期間的高注意力分數。這一發現表明,KV對的影響可能在查詢之前就由鍵嵌入本身確定。基於這一觀察,我們根據鍵嵌入的L_2對KV快取進行壓縮。我們的實驗結果顯示,這種簡單策略可以在語言建模和尋找針在一堆草堆任務中將KV快取大小減少50%,在密碼檢索任務中減少90%,而不損失準確性。
English
The deployment of large language models (LLMs) is often hindered by the
extensive memory requirements of the Key-Value (KV) cache, especially as
context lengths increase. Existing approaches to reduce the KV cache size
involve either fine-tuning the model to learn a compression strategy or
leveraging attention scores to reduce the sequence length. We analyse the
attention distributions in decoder-only Transformers-based models and observe
that attention allocation patterns stay consistent across most layers.
Surprisingly, we find a clear correlation between the L_2 and the attention
scores over cached KV pairs, where a low L_2 of a key embedding usually leads
to a high attention score during decoding. This finding indicates that the
influence of a KV pair is potentially determined by the key embedding itself
before being queried. Based on this observation, we compress the KV cache based
on the L_2 of key embeddings. Our experimental results show that this simple
strategy can reduce the KV cache size by 50% on language modelling and
needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing
accuracy.Summary
AI-Generated Summary