超越同質性注意力：通過傅立葉近似鍵值緩存實現記憶體高效的大型語言模型

摘要

隨著上下文長度的增加，大型語言模型在處理不斷增長的鍵值（KV）緩存時面臨記憶體需求的挑戰。現有的壓縮方法要麼統一頭部維度，要麼依賴於注意力引導的令牌修剪，這些方法往往會犧牲準確性或引入計算開銷。我們提出了FourierAttention，這是一個無需訓練的框架，它利用了變壓器頭部維度的異質性角色：較低的維度優先考慮局部上下文，而較高的維度則捕捉長程依賴關係。通過將對長上下文不敏感的維度投影到正交傅立葉基上，FourierAttention用固定長度的頻譜係數近似它們的時間演化。在LLaMA模型上的評估顯示，FourierAttention在LongBench和Needle-In-A-Haystack（NIAH）上實現了最佳的長上下文準確性。此外，我們設計了一個定制的Triton內核，FlashFourierAttention，通過簡化的讀寫操作來優化記憶體，從而實現高效部署而不影響性能。

English

Large Language Models struggle with memory demands from the growing Key-Value (KV) cache as context lengths increase. Existing compression methods homogenize head dimensions or rely on attention-guided token pruning, often sacrificing accuracy or introducing computational overhead. We propose FourierAttention, a training-free framework that exploits the heterogeneous roles of transformer head dimensions: lower dimensions prioritize local context, while upper ones capture long-range dependencies. By projecting the long-context-insensitive dimensions onto orthogonal Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients. Evaluations on LLaMA models show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel, FlashFourierAttention, is designed to optimize memory via streamlined read-write operations, enabling efficient deployment without performance compromise.

超越同質性注意力：通過傅立葉近似鍵值緩存實現記憶體高效的大型語言模型

Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

摘要

Support