ChatPaper.aiChatPaper

超越同質性注意力:通過傅立葉近似鍵值緩存實現記憶體高效的大型語言模型

Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

June 13, 2025
作者: Xiaoran Liu, Siyang He, Qiqi Wang, Ruixiao Li, Yuerong Song, Zhigeng Liu, Linlin Li, Qun Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
cs.AI

摘要

隨著上下文長度的增加,大型語言模型在處理不斷增長的鍵值(KV)緩存時面臨記憶體需求的挑戰。現有的壓縮方法要麼統一頭部維度,要麼依賴於注意力引導的令牌修剪,這些方法往往會犧牲準確性或引入計算開銷。我們提出了FourierAttention,這是一個無需訓練的框架,它利用了變壓器頭部維度的異質性角色:較低的維度優先考慮局部上下文,而較高的維度則捕捉長程依賴關係。通過將對長上下文不敏感的維度投影到正交傅立葉基上,FourierAttention用固定長度的頻譜係數近似它們的時間演化。在LLaMA模型上的評估顯示,FourierAttention在LongBench和Needle-In-A-Haystack(NIAH)上實現了最佳的長上下文準確性。此外,我們設計了一個定制的Triton內核,FlashFourierAttention,通過簡化的讀寫操作來優化記憶體,從而實現高效部署而不影響性能。
English
Large Language Models struggle with memory demands from the growing Key-Value (KV) cache as context lengths increase. Existing compression methods homogenize head dimensions or rely on attention-guided token pruning, often sacrificing accuracy or introducing computational overhead. We propose FourierAttention, a training-free framework that exploits the heterogeneous roles of transformer head dimensions: lower dimensions prioritize local context, while upper ones capture long-range dependencies. By projecting the long-context-insensitive dimensions onto orthogonal Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients. Evaluations on LLaMA models show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel, FlashFourierAttention, is designed to optimize memory via streamlined read-write operations, enabling efficient deployment without performance compromise.
PDF182June 16, 2025