SageAttention：用于即插即用推理的准确8位注意力加速。

摘要

变压器架构在各种模型中占主导地位。作为变压器的核心，注意力的计算复杂度为O(N^2)，而线性变换为O(N)。在处理大序列长度时，注意力成为主要耗时组件。尽管量化已被证明是加速模型推断的有效方法，但现有的量化方法主要集中在优化线性层。为此，我们首先详细分析了在注意力中量化的可行性。随后，我们提出了SageAttention，这是一种高效准确的注意力量化方法。我们的方法的每秒操作数（OPS）优于FlashAttention2和xformers约2.1倍和2.7倍。SageAttention在准确性表现上也优于FlashAttention3。全面的实验证实了我们的方法在各种模型上几乎不会造成端到端指标损失，包括用于大型语言处理、图像生成和视频生成的模型。

English

The transformer architecture predominates across various models. As the heart of the transformer, attention has a computational complexity of O(N^2), compared to O(N) for linear transformations. When handling large sequence lengths, attention becomes the primary time-consuming component. Although quantization has proven to be an effective method for accelerating model inference, existing quantization methods primarily focus on optimizing the linear layer. In response, we first analyze the feasibility of quantization in attention detailedly. Following that, we propose SageAttention, a highly efficient and accurate quantization method for attention. The OPS (operations per second) of our approach outperforms FlashAttention2 and xformers by about 2.1 times and 2.7 times, respectively. SageAttention also achieves superior accuracy performance over FlashAttention3. Comprehensive experiments confirm that our approach incurs almost no end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation.

SageAttention：用于即插即用推理的准确8位注意力加速。

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

摘要

Support