修正稀疏注意力

摘要

高效生成长序列是大型语言模型面临的关键挑战。尽管近期的稀疏解码方法提升了效率，但它们存在KV缓存错位问题，即近似误差不断累积，导致生成质量下降。本研究提出了一种简单而有效的解决方案——修正稀疏注意力机制（ReSA），该方法将块稀疏注意力与周期性密集修正相结合。通过在固定间隔使用密集前向传递刷新KV缓存，ReSA有效限制了误差累积，保持了与预训练分布的一致性。在数学推理、语言建模及检索任务上的实验表明，ReSA在显著提升效率的同时，实现了近乎无损的生成质量。尤为突出的是，在256K序列长度的解码场景下，ReSA带来了高达2.42倍的端到端加速，使其成为可扩展长上下文推理的实用方案。代码已发布于https://aka.ms/ReSA-LM。

English

Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42times end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference. Code is available at https://aka.ms/ReSA-LM.