ChatPaper.aiChatPaper

FASA:频率感知稀疏注意力

FASA: Frequency-aware Sparse Attention

February 3, 2026
作者: Yifei Wang, Yueqi Wang, Zhenrui Yue, Huimin Zeng, Yong Wang, Ismini Lourentzou, Zhengzhong Tu, Xiangxiang Chu, Julian McAuley
cs.AI

摘要

大语言模型(LLM)在处理长序列输入时面临关键瓶颈:键值(KV)缓存的内存占用过高。为突破此瓶颈,令牌剪枝范式利用注意力稀疏性选择性地保留少量关键令牌。然而现有方法存在局限——静态策略可能造成不可逆信息损失,动态策略采用的启发式规则难以充分捕捉令牌重要性的查询相关性。我们提出FASA框架,通过动态预测令牌重要性实现查询感知的令牌淘汰机制。该框架源于对RoPE的新发现:我们在频率块(FC)层面发现了功能性稀疏现象。核心发现是,存在少量可识别的"主导性"FC子集,其上下文一致性始终与完整注意力头保持高度吻合,这为识别重要令牌提供了零计算成本的鲁棒代理指标。基于此,FASA首先利用主导性FC筛选关键令牌集合,随后仅对剪枝后的子集执行聚焦注意力计算。由于仅需访问少量KV缓存,FASA显著降低了内存带宽需求与计算成本。在从序列建模到复杂思维链推理的长上下文任务中,FASA全面超越所有令牌淘汰基线,在受限预算下仍保持接近理论最优的准确率,展现出卓越的鲁棒性。在LongBench-V1基准测试中,FASA仅保留256个令牌即可达到完整KV缓存性能的近乎100%,在AIME24任务上仅需18.9%缓存即实现2.56倍加速。
English
The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. %making them a powerful and efficient proxy for token importance. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. % Since accessing only a small fraction of the KV cache, FASA drastically lowers memory bandwidth requirements and computational cost. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100\% of full-KV performance when only keeping 256 tokens, and achieves 2.56times speedup using just 18.9\% of the cache on AIME24.
PDF1013February 6, 2026