修正スパースアテンション

要旨

長文生成の効率化は、大規模言語モデルにとって重要な課題です。最近のスパースデコーディング手法は効率を向上させますが、KVキャッシュの不整合が発生し、近似誤差が蓄積して生成品質が低下する問題があります。本研究では、ブロックスパースアテンションと定期的な密な補正を組み合わせた、シンプルかつ効果的な手法であるRectified Sparse Attention（ReSA）を提案します。ReSAは、一定間隔で密なフォワードパスを使用してKVキャッシュを更新することで、誤差の蓄積を抑え、事前学習分布との整合性を維持します。数学的推論、言語モデリング、検索タスクにわたる実験により、ReSAが大幅な効率向上を図りつつ、ほぼロスレスな生成品質を達成することが実証されました。特に、ReSAは256Kのシーケンス長でのデコーディングにおいて最大2.42倍のエンドツーエンドの高速化を実現し、スケーラブルな長文コンテキスト推論の実用的なソリューションとなっています。コードはhttps://aka.ms/ReSA-LMで公開されています。

English

Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42times end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference. Code is available at https://aka.ms/ReSA-LM.