均質なアテンションを超えて：フーリエ近似KVキャッシュによるメモリ効率の良いLLM

要旨

大規模言語モデルは、コンテキスト長が増加するにつれて、Key-Value（KV）キャッシュのメモリ要求に対処するのに苦労している。既存の圧縮手法は、ヘッド次元を均一化するか、注意機構に基づくトークンの刈り込みに依存しており、しばしば精度を犠牲にしたり、計算オーバーヘッドを導入したりしている。本研究では、FourierAttentionを提案する。これは、トランスフォーマーのヘッド次元の異質な役割を活用するトレーニング不要のフレームワークである。具体的には、下位の次元は局所的なコンテキストを優先し、上位の次元は長距離の依存関係を捉える。長いコンテキストに敏感でない次元を直交するフーリエ基底に投影することで、FourierAttentionはそれらの時間的進化を固定長のスペクトル係数で近似する。LLaMAモデルでの評価では、FourierAttentionがLongBenchおよびNeedle-In-A-Haystack（NIAH）において最良の長文コンテキスト精度を達成している。さらに、カスタムTritonカーネルであるFlashFourierAttentionを設計し、効率的な読み書き操作を通じてメモリを最適化し、性能を損なうことなく効率的なデプロイメントを可能にしている。

English

Large Language Models struggle with memory demands from the growing Key-Value (KV) cache as context lengths increase. Existing compression methods homogenize head dimensions or rely on attention-guided token pruning, often sacrificing accuracy or introducing computational overhead. We propose FourierAttention, a training-free framework that exploits the heterogeneous roles of transformer head dimensions: lower dimensions prioritize local context, while upper ones capture long-range dependencies. By projecting the long-context-insensitive dimensions onto orthogonal Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients. Evaluations on LLaMA models show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel, FlashFourierAttention, is designed to optimize memory via streamlined read-write operations, enabling efficient deployment without performance compromise.

均質なアテンションを超えて：フーリエ近似KVキャッシュによるメモリ効率の良いLLM

Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

要旨

Support