少即是多:具全局局部性的無訓練稀疏注意力機制於高效推理之應用
Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning
August 9, 2025
作者: Lijie Yang, Zhihao Zhang, Arti Jain, Shijie Cao, Baihong Yuan, Yiwei Chen, Zhihao Jia, Ravi Netravali
cs.AI
摘要
大型推理模型通过测试时的规模扩展实现了强劲的性能,但同时也带来了显著的计算开销,尤其是在处理简短输入提示时因生成过多令牌而加剧。尽管稀疏注意力机制能够降低延迟和内存使用,现有方法由于在长序列推理过程中累积的错误而遭受显著的准确度下降。这些方法通常需要高令牌保留率或昂贵的重新训练。我们引入了LessIsMore,一种无需训练的稀疏注意力机制,专为推理任务设计,它利用全局注意力模式而非依赖传统的针对特定头的局部优化。LessIsMore通过整合来自局部注意力头的令牌选择与最近的上下文信息,实现了对未来解码层的统一跨头令牌排序。这种统一选择通过避免为每个头维护独立的令牌子集,提升了泛化能力和效率。在多种推理任务和基准测试中的评估显示,LessIsMore不仅保持了——在某些情况下还提高了——准确度,同时相比全注意力机制实现了1.1倍的解码速度提升。此外,LessIsMore关注了2倍少的令牌而不损失准确度,与现有稀疏注意力方法相比,实现了1.13倍的端到端速度提升。
English
Large reasoning models achieve strong performance through test-time scaling
but incur substantial computational overhead, particularly from excessive token
generation when processing short input prompts. While sparse attention
mechanisms can reduce latency and memory usage, existing approaches suffer from
significant accuracy degradation due to accumulated errors during
long-generation reasoning. These methods generally require either high token
retention rates or expensive retraining. We introduce LessIsMore, a
training-free sparse attention mechanism for reasoning tasks, which leverages
global attention patterns rather than relying on traditional head-specific
local optimizations. LessIsMore aggregates token selections from local
attention heads with recent contextual information, enabling unified cross-head
token ranking for future decoding layers. This unified selection improves
generalization and efficiency by avoiding the need to maintain separate token
subsets per head. Evaluation across diverse reasoning tasks and benchmarks
shows that LessIsMore preserves -- and in some cases improves -- accuracy while
achieving a 1.1times average decoding speed-up compared to full attention.
Moreover, LessIsMore attends to 2times fewer tokens without accuracy loss,
achieving a 1.13times end-to-end speed-up compared to existing sparse
attention methods.