FASA:頻率感知稀疏注意力
FASA: Frequency-aware Sparse Attention
February 3, 2026
作者: Yifei Wang, Yueqi Wang, Zhenrui Yue, Huimin Zeng, Yong Wang, Ismini Lourentzou, Zhengzhong Tu, Xiangxiang Chu, Julian McAuley
cs.AI
摘要
大型語言模型(LLMs)在處理長序列輸入時面臨關鍵瓶頸:鍵值(KV)快取的記憶體佔用量過高。為突破此限制,詞元修剪範式利用注意力稀疏性,選擇性保留少量關鍵詞元。然而現有方法存在明顯缺陷——靜態策略可能導致不可逆的資訊損失,而動態策略採用的啟發式方法難以充分捕捉詞元重要性與查詢的關聯性。為此我們提出FASA框架,透過動態預測詞元重要性實現查詢感知的詞元淘汰機制。FASA源於對RoPE(旋轉位置編碼)的全新發現:我們在頻率塊(FC)層面觀測到功能稀疏性現象。關鍵在於,一小部分可識別的「主導性」FC始終與完整注意力頭保持高度上下文一致性,這為識別重要詞元提供了無需計算成本的強健代理指標。基於此洞見,FASA先透過主導性FC篩選關鍵詞元集合,再僅對修剪後的子集執行聚焦注意力計算。由於僅需存取極小部分KV快取,FASA大幅降低記憶體頻寬需求與計算成本。在從序列建模到複雜思維鏈推理的長上下文任務中,FASA持續超越所有詞元淘汰基線模型,實現接近預言機的準確率,即便在嚴格資源約束下仍展現卓越魯棒性。值得注意的是,在LongBench-V1基準測試中,FASA僅保留256個詞元即可達到完整KV效能的近100%,在AIME24任務中僅使用18.9%快取即實現2.56倍加速比。
English
The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. %making them a powerful and efficient proxy for token importance. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. % Since accessing only a small fraction of the KV cache, FASA drastically lowers memory bandwidth requirements and computational cost. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100\% of full-KV performance when only keeping 256 tokens, and achieves 2.56times speedup using just 18.9\% of the cache on AIME24.