XAttention:采用对角线评分机制的块稀疏注意力机制
XAttention: Block Sparse Attention with Antidiagonal Scoring
March 20, 2025
作者: Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, Song Han
cs.AI
摘要
长上下文Transformer模型(LCTMs)在现实应用中至关重要,但由于注意力机制的二次方复杂度,其计算成本高昂。块稀疏注意力通过将计算集中在关键区域来缓解这一问题,然而现有方法因昂贵的块重要性评估而难以在准确性与效率之间取得平衡。本文提出XAttention,一个即插即用的框架,利用稀疏注意力显著加速Transformer模型中的长上下文推理。XAttention的核心创新在于发现注意力矩阵中反对角线值(即从左下到右上)的总和可作为块重要性的强有力代理。这一发现使得能够精确识别并剪除非必要块,从而实现高稀疏度并大幅加速推理。在包括语言领域的RULER和LongBench、视频理解领域的VideoMME以及视频生成领域的VBench等严苛的长上下文基准测试中,XAttention在保持与全注意力相当准确性的同时,带来了显著的计算效率提升。我们展示了注意力计算最高可达13.5倍的加速。这些成果凸显了XAttention在释放块稀疏注意力实际潜力方面的能力,为LCTMs在现实应用中的可扩展且高效部署铺平了道路。代码已发布于https://github.com/mit-han-lab/x-attention。
English
Long-Context Transformer Models (LCTMs) are vital for real-world applications
but suffer high computational costs due to attention's quadratic complexity.
Block-sparse attention mitigates this by focusing computation on critical
regions, yet existing methods struggle with balancing accuracy and efficiency
due to costly block importance measurements. In this paper, we introduce
XAttention, a plug-and-play framework that dramatically accelerates
long-context inference in Transformers models using sparse attention.
XAttention's key innovation is the insight that the sum of antidiagonal values
(i.e., from the lower-left to upper-right) in the attention matrix provides a
powerful proxy for block importance. This allows for precise identification and
pruning of non-essential blocks, resulting in high sparsity and dramatically
accelerated inference. Across comprehensive evaluations on demanding
long-context benchmarks-including RULER and LongBench for language, VideoMME
for video understanding, and VBench for video generation. XAttention achieves
accuracy comparable to full attention while delivering substantial
computational gains. We demonstrate up to 13.5x acceleration in attention
computation. These results underscore XAttention's ability to unlock the
practical potential of block sparse attention, paving the way for scalable and
efficient deployment of LCTMs in real-world applications. Code is available at
https://github.com/mit-han-lab/x-attention.Summary
AI-Generated Summary