少ないほど良い：効率的な推論のためのグローバルな局所性を備えたトレーニング不要のスパースアテンション

要旨

大規模な推論モデルは、テスト時のスケーリングによって高い性能を達成しますが、特に短い入力プロンプトを処理する際の過剰なトークン生成により、多大な計算コストが発生します。スパースアテンションメカニズムはレイテンシとメモリ使用量を削減できますが、既存の手法では、長い生成推論中に蓄積されるエラーにより、精度の大幅な低下が生じます。これらの手法は一般的に、高いトークン保持率か、高コストな再学習を必要とします。本論文では、推論タスク向けの学習不要なスパースアテンションメカニズムであるLessIsMoreを提案します。これは、従来のヘッド固有の局所最適化に依存するのではなく、グローバルなアテーションパターンを活用します。LessIsMoreは、最近の文脈情報と共に局所アテーションヘッドからのトークン選択を集約し、将来のデコード層のための統一されたクロスヘッドトークンランキングを可能にします。この統一された選択により、ヘッドごとに別々のトークンサブセットを維持する必要がなくなり、汎化性と効率性が向上します。多様な推論タスクとベンチマークでの評価により、LessIsMoreは精度を維持し、場合によっては向上させながら、フルアテンションと比較して平均1.1倍のデコード速度向上を達成することが示されました。さらに、LessIsMoreは精度の低下なしに2倍少ないトークンにアテンションし、既存のスパースアテンションメソッドと比較して1.13倍のエンドツーエンド速度向上を実現します。

English

Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves -- and in some cases improves -- accuracy while achieving a 1.1times average decoding speed-up compared to full attention. Moreover, LessIsMore attends to 2times fewer tokens without accuracy loss, achieving a 1.13times end-to-end speed-up compared to existing sparse attention methods.

少ないほど良い：効率的な推論のためのグローバルな局所性を備えたトレーニング不要のスパースアテンション

Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

要旨

Support