SageBwd:可訓練的低位元注意力機制
SageBwd: A Trainable Low-bit Attention
March 2, 2026
作者: Jintao Zhang, Marco Chen, Haoxu Wang, Kai Jiang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, Jun Zhu
cs.AI
摘要
低精度注意力機制(如SageAttention)已成為加速模型推理的有效方法,但其在訓練階段的適用性仍缺乏深入研究。在先前工作中,我們提出了可訓練的INT8注意力機制SageBwd,該方法在保持微調性能的同時,對七個注意力矩陣乘法中的六個進行量化。然而,SageBwd在預訓練階段始終與全精度注意力(FPA)存在性能差距。本研究深入探討了該差距的成因,並證明SageBwd在預訓練階段可達到與全精度注意力相當的性能。通過實驗與理論分析,我們得出以下重要結論:(i)QK歸一化對大規模單步令牌數的穩定訓練至關重要;(ii)量化誤差主要來源於反向傳播中的分數梯度dS;(iii)降低單步令牌數可使SageBwd在預訓練中匹配FPA性能;(iv)K平滑對訓練穩定性仍不可或缺,而Q平滑在預訓練階段的效益有限。
English
Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications while preserving fine-tuning performance. However, SageBwd exhibited a persistent performance gap to full-precision attention (FPA) during pre-training. In this work, we investigate why this gap occurs and demonstrate that SageBwd matches full-precision attention during pretraining. Through experiments and theoretical analysis, we reach a few important insights and conclusions: (i) QK-norm is necessary for stable training at large tokens per step, (ii) quantization errors primarily arise from the backward-pass score gradient dS, (iii) reducing tokens per step enables SageBwd to match FPA performance in pre-training, and (iv) K-smoothing remains essential for training stability, while Q-smoothing provides limited benefit during pre-training.