超越同质注意力:通过傅里叶近似KV缓存实现内存高效的大型语言模型
Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache
June 13, 2025
作者: Xiaoran Liu, Siyang He, Qiqi Wang, Ruixiao Li, Yuerong Song, Zhigeng Liu, Linlin Li, Qun Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
cs.AI
摘要
随着上下文长度的增加,大型语言模型在处理不断膨胀的键值(KV)缓存时面临内存需求的挑战。现有的压缩方法要么统一头部维度,要么依赖注意力引导的令牌剪枝,往往牺牲了准确性或引入了计算开销。我们提出了FourierAttention,一种无需训练的框架,它利用了Transformer头部维度的异质性:较低维度优先处理局部上下文,而较高维度则捕捉长距离依赖关系。通过将长上下文不敏感的维度投影到正交傅里叶基上,FourierAttention用固定长度的频谱系数近似其时间演化。在LLaMA模型上的评估显示,FourierAttention在LongBench和Needle-In-A-Haystack(NIAH)测试中实现了最佳的长上下文准确性。此外,我们还设计了一个定制的Triton内核——FlashFourierAttention,通过简化的读写操作优化内存,实现了高效部署且不牺牲性能。
English
Large Language Models struggle with memory demands from the growing Key-Value
(KV) cache as context lengths increase. Existing compression methods homogenize
head dimensions or rely on attention-guided token pruning, often sacrificing
accuracy or introducing computational overhead. We propose FourierAttention, a
training-free framework that exploits the heterogeneous roles of transformer
head dimensions: lower dimensions prioritize local context, while upper ones
capture long-range dependencies. By projecting the long-context-insensitive
dimensions onto orthogonal Fourier bases, FourierAttention approximates their
temporal evolution with fixed-length spectral coefficients. Evaluations on
LLaMA models show that FourierAttention achieves the best long-context accuracy
on LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel,
FlashFourierAttention, is designed to optimize memory via streamlined
read-write operations, enabling efficient deployment without performance
compromise.