ChatPaper.aiChatPaper

Delta注意力:通過Delta校正實現快速且精確的稀疏注意力推論

Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction

May 16, 2025
作者: Jeffrey Willette, Heejun Lee, Sung Ju Hwang
cs.AI

摘要

變壓器中的注意力機制具有二次方複雜度,這導致在處理長序列時產生高昂的推理成本和延遲。然而,注意力矩陣大多是稀疏的,這意味著許多條目可以從計算中省略,以實現高效的推理。稀疏注意力推理方法旨在減輕這一計算負擔;然而,它們也伴隨著令人困擾的性能下降。我們發現,這種性能下降的一個原因是稀疏計算導致了注意力輸出的分佈偏移。這種分佈偏移使得解碼階段的查詢無法與預填充階段的適當鍵良好對齊,從而導致性能下降。我們提出了一種簡單、新穎且有效的方法來校正這種分佈偏移,使稀疏注意力輸出的分佈更接近於二次方注意力的分佈。我們的方法可以應用於任何稀疏注意力方法之上,並在131K RULER基準測試中,當應用於帶有下沉標記的滑動窗口注意力之上時,平均提升了36%的性能,恢復了88%的二次方注意力準確率,同時僅增加了少量的開銷。我們的方法能夠保持約98.5%的稀疏度,相較於完整的二次方注意力,使我們的模型在處理1M標記預填充時比Flash Attention 2快32倍。
English
The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention. Our method can be applied on top of any sparse attention method, and results in an average 36%pt performance increase, recovering 88% of quadratic attention accuracy on the 131K RULER benchmark when applied on top of sliding window attention with sink tokens while only adding a small overhead. Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.

Summary

AI-Generated Summary

PDF361May 20, 2025