SageBwd:一种可训练的低比特注意力机制
SageBwd: A Trainable Low-bit Attention
March 2, 2026
作者: Jintao Zhang, Marco Chen, Haoxu Wang, Kai Jiang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, Jun Zhu
cs.AI
摘要
低比特注意力(如SageAttention)已成为加速模型推理的有效方法,但其在训练中的适用性仍不明确。在先前工作中,我们提出了SageBwd——一种可训练的INT8注意力机制,它在保持微调性能的同时对七个注意力矩阵乘法中的六个进行量化。然而,SageBwd在预训练阶段始终与全精度注意力(FPA)存在性能差距。本研究通过实验与理论分析揭示了该差距的成因,并证明SageBwd在预训练中可达到与全精度注意力相当的性能。我们获得以下重要结论:(i)QK归一化是大步长token训练稳定性的必要条件;(ii)量化误差主要源于反向传播的分数梯度dS;(iii)减少步长token数可使SageBwd在预训练中匹配FPA性能;(iv)K平滑对训练稳定性仍至关重要,而Q平滑在预训练中收益有限。
English
Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications while preserving fine-tuning performance. However, SageBwd exhibited a persistent performance gap to full-precision attention (FPA) during pre-training. In this work, we investigate why this gap occurs and demonstrate that SageBwd matches full-precision attention during pretraining. Through experiments and theoretical analysis, we reach a few important insights and conclusions: (i) QK-norm is necessary for stable training at large tokens per step, (ii) quantization errors primarily arise from the backward-pass score gradient dS, (iii) reducing tokens per step enables SageBwd to match FPA performance in pre-training, and (iv) K-smoothing remains essential for training stability, while Q-smoothing provides limited benefit during pre-training.