超越同质注意力：通过傅里叶近似KV缓存实现内存高效的大型语言模型

摘要

随着上下文长度的增加，大型语言模型在处理不断膨胀的键值（KV）缓存时面临内存需求的挑战。现有的压缩方法要么统一头部维度，要么依赖注意力引导的令牌剪枝，往往牺牲了准确性或引入了计算开销。我们提出了FourierAttention，一种无需训练的框架，它利用了Transformer头部维度的异质性：较低维度优先处理局部上下文，而较高维度则捕捉长距离依赖关系。通过将长上下文不敏感的维度投影到正交傅里叶基上，FourierAttention用固定长度的频谱系数近似其时间演化。在LLaMA模型上的评估显示，FourierAttention在LongBench和Needle-In-A-Haystack（NIAH）测试中实现了最佳的长上下文准确性。此外，我们还设计了一个定制的Triton内核——FlashFourierAttention，通过简化的读写操作优化内存，实现了高效部署且不牺牲性能。

English

Large Language Models struggle with memory demands from the growing Key-Value (KV) cache as context lengths increase. Existing compression methods homogenize head dimensions or rely on attention-guided token pruning, often sacrificing accuracy or introducing computational overhead. We propose FourierAttention, a training-free framework that exploits the heterogeneous roles of transformer head dimensions: lower dimensions prioritize local context, while upper ones capture long-range dependencies. By projecting the long-context-insensitive dimensions onto orthogonal Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients. Evaluations on LLaMA models show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel, FlashFourierAttention, is designed to optimize memory via streamlined read-write operations, enabling efficient deployment without performance compromise.

超越同质注意力：通过傅里叶近似KV缓存实现内存高效的大型语言模型

Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

摘要

Support