SageBwd : Une attention entraînable à faible précision

Résumé

L'attention en faible précision, telle que SageAttention, est apparue comme une approche efficace pour accélérer l'inférence des modèles, mais son applicabilité à l'entraînement reste mal comprise. Dans des travaux antérieurs, nous avons introduit SageBwd, une attention entraînable en INT8 qui quantifie six des sept multiplications matricielles de l'attention tout en préservant les performances du fine-tuning. Cependant, SageBwd présentait un écart de performance persistant par rapport à l'attention en pleine précision (FPA) lors du pré-entraînement. Dans ce travail, nous étudions pourquoi cet écart se produit et démontrons que SageBwd atteint les performances de l'attention en pleine précision durant le pré-entraînement. Par des expériences et une analyse théorique, nous obtenons plusieurs insights et conclusions importants : (i) la QK-norm est nécessaire pour un entraînement stable avec un grand nombre de tokens par étape, (ii) les erreurs de quantification proviennent principalement du gradient des scores dS lors de la rétropropagation, (iii) réduire le nombre de tokens par étape permet à SageBwd d'égaler les performances de la FPA en pré-entraînement, et (iv) le lissage de K reste essentiel pour la stabilité de l'entraînement, tandis que le lissage de Q offre un bénéfice limité durant le pré-entraînement.

English

Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications while preserving fine-tuning performance. However, SageBwd exhibited a persistent performance gap to full-precision attention (FPA) during pre-training. In this work, we investigate why this gap occurs and demonstrate that SageBwd matches full-precision attention during pretraining. Through experiments and theoretical analysis, we reach a few important insights and conclusions: (i) QK-norm is necessary for stable training at large tokens per step, (ii) quantization errors primarily arise from the backward-pass score gradient dS, (iii) reducing tokens per step enables SageBwd to match FPA performance in pre-training, and (iv) K-smoothing remains essential for training stability, while Q-smoothing provides limited benefit during pre-training.

SageBwd : Une attention entraînable à faible précision

SageBwd: A Trainable Low-bit Attention

Résumé

Support