少即是多：具全局局部性的無訓練稀疏注意力機制於高效推理之應用

摘要

大型推理模型通过测试时的规模扩展实现了强劲的性能，但同时也带来了显著的计算开销，尤其是在处理简短输入提示时因生成过多令牌而加剧。尽管稀疏注意力机制能够降低延迟和内存使用，现有方法由于在长序列推理过程中累积的错误而遭受显著的准确度下降。这些方法通常需要高令牌保留率或昂贵的重新训练。我们引入了LessIsMore，一种无需训练的稀疏注意力机制，专为推理任务设计，它利用全局注意力模式而非依赖传统的针对特定头的局部优化。LessIsMore通过整合来自局部注意力头的令牌选择与最近的上下文信息，实现了对未来解码层的统一跨头令牌排序。这种统一选择通过避免为每个头维护独立的令牌子集，提升了泛化能力和效率。在多种推理任务和基准测试中的评估显示，LessIsMore不仅保持了——在某些情况下还提高了——准确度，同时相比全注意力机制实现了1.1倍的解码速度提升。此外，LessIsMore关注了2倍少的令牌而不损失准确度，与现有稀疏注意力方法相比，实现了1.13倍的端到端速度提升。

English

Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves -- and in some cases improves -- accuracy while achieving a 1.1times average decoding speed-up compared to full attention. Moreover, LessIsMore attends to 2times fewer tokens without accuracy loss, achieving a 1.13times end-to-end speed-up compared to existing sparse attention methods.

少即是多：具全局局部性的無訓練稀疏注意力機制於高效推理之應用

Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

摘要

Support