SageBwd: 学習可能な低ビットアテンション

要旨

低ビットアテンション（SageAttentionなど）は、モデル推論の高速化における効果的な手法として登場したが、学習への適用可能性については未解明な部分が多い。先行研究では、ファインチューニング性能を維持しつつ7つのアテンション行列乗算のうち6つを量子化する学習可能なINT8アテンションであるSageBwdを提案した。しかし、SageBwdは事前学習時において全精度アテンション（FPA）との性能差が残る課題があった。本研究ではこの性能差が生じる原因を解明し、SageBwdが事前学習時においてFPAと同等の性能を達成できることを実証する。実験と理論分析を通じて、以下の重要な知見と結論を得た：（i）大規模なトークン数/ステップでの安定した学習にはQK正規化が必須である、（ii）量子化誤差は主に逆伝播時のスコア勾配dSに起因する、（iii）トークン数/ステップを削減することでSageBwdは事前学習においてFPA性能に匹敵する、（iv）K平滑化は学習安定性に不可欠である一方、Q平滑化は事前学習では効果が限定的である。

English

Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications while preserving fine-tuning performance. However, SageBwd exhibited a persistent performance gap to full-precision attention (FPA) during pre-training. In this work, we investigate why this gap occurs and demonstrate that SageBwd matches full-precision attention during pretraining. Through experiments and theoretical analysis, we reach a few important insights and conclusions: (i) QK-norm is necessary for stable training at large tokens per step, (ii) quantization errors primarily arise from the backward-pass score gradient dS, (iii) reducing tokens per step enables SageBwd to match FPA performance in pre-training, and (iv) K-smoothing remains essential for training stability, while Q-smoothing provides limited benefit during pre-training.

SageBwd: 学習可能な低ビットアテンション

SageBwd: A Trainable Low-bit Attention

要旨

Support