少即是多：具备全局局部性的免训练稀疏注意力机制，助力高效推理

摘要

大型推理模型通过测试阶段的扩展实现了强劲性能，但带来了显著的计算开销，尤其是在处理简短输入提示时产生的过多令牌生成。尽管稀疏注意力机制能够降低延迟和内存使用，现有方法因长序列推理过程中累积的错误而遭受显著的准确率下降。这些方法通常需要高令牌保留率或昂贵的重新训练。我们提出了LessIsMore，一种无需训练的稀疏注意力机制，专为推理任务设计，它利用全局注意力模式而非依赖传统的头特定局部优化。LessIsMore通过整合来自局部注意力头的令牌选择与最新上下文信息，实现了对未来解码层的统一跨头令牌排序。这种统一选择通过避免为每个头维护独立的令牌子集，提升了泛化能力和效率。在多种推理任务和基准测试中的评估显示，LessIsMore不仅保持了——在某些情况下还提升了——准确率，同时相比全注意力机制实现了1.1倍的解码速度提升。此外，LessIsMore在无准确率损失的情况下关注了2倍少的令牌，相比现有稀疏注意力方法实现了1.13倍的端到端速度提升。

English

Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves -- and in some cases improves -- accuracy while achieving a 1.1times average decoding speed-up compared to full attention. Moreover, LessIsMore attends to 2times fewer tokens without accuracy loss, achieving a 1.13times end-to-end speed-up compared to existing sparse attention methods.

少即是多：具备全局局部性的免训练稀疏注意力机制，助力高效推理

Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

摘要

Support