修正稀疏注意力

摘要

高效生成长序列是大型語言模型面臨的關鍵挑戰。雖然近期的稀疏解碼方法提升了效率，但它們存在KV快取對齊問題，其中近似誤差會累積並降低生成質量。在本研究中，我們提出了校正稀疏注意力（ReSA），這是一種簡單而有效的方法，它將塊稀疏注意力與周期性密集校正相結合。通過在固定間隔使用密集前向傳播來刷新KV快取，ReSA限制了誤差累積並保持了與預訓練分佈的對齊。在數學推理、語言建模和檢索任務上的實驗表明，ReSA在顯著提升效率的同時，實現了近乎無損的生成質量。值得注意的是，在256K序列長度的解碼下，ReSA實現了最高2.42倍的端到端加速，使其成為可擴展長上下文推理的實用解決方案。代碼可在https://aka.ms/ReSA-LM獲取。

English

Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42times end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference. Code is available at https://aka.ms/ReSA-LM.