ChatPaper.aiChatPaper

XAttention:基於對角線評分的塊稀疏注意力機制

XAttention: Block Sparse Attention with Antidiagonal Scoring

March 20, 2025
作者: Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, Song Han
cs.AI

摘要

長上下文Transformer模型(LCTMs)在實際應用中至關重要,但由於注意力機制的二次方計算複雜度,其計算成本高昂。塊稀疏注意力通過將計算集中在關鍵區域來緩解這一問題,然而現有方法在平衡準確性和效率方面存在困難,原因在於塊重要性測量的高成本。本文介紹了XAttention,這是一個即插即用的框架,利用稀疏注意力顯著加速了Transformer模型中的長上下文推理。XAttention的關鍵創新在於發現注意力矩陣中反對角線值(即從左下到右上)的總和為塊重要性提供了強大的代理。這使得能夠精確識別並剪枝非必要塊,從而實現高稀疏性和顯著加速的推理。在包括語言領域的RULER和LongBench、視頻理解的VideoMME以及視頻生成的VBench在內的嚴苛長上下文基準測試中,XAttention在保持與全注意力相當的準確性的同時,帶來了顯著的計算增益。我們展示了注意力計算高達13.5倍的加速。這些結果證明了XAttention在解鎖塊稀疏注意力實際潛力方面的能力,為LCTMs在實際應用中的可擴展和高效部署鋪平了道路。代碼可在https://github.com/mit-han-lab/x-attention獲取。
English
Long-Context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements. In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention's key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference. Across comprehensive evaluations on demanding long-context benchmarks-including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation. XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention's ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications. Code is available at https://github.com/mit-han-lab/x-attention.

Summary

AI-Generated Summary

PDF142March 21, 2025