SageBwd: 훈련 가능한 저비트 어텐션

초록

SageAttention과 같은 저비트 어텐션은 모델 추론 가속화를 위한 효과적인 접근법으로 부상했으나, 학습 적용 가능성은 여전히 제대로 이해되지 않고 있습니다. 선행 연구에서 우리는 7개 어텐션 행렬 곱셈 중 6개를 양자화하면서 미세 조정 성능을 유지하는 학습 가능한 INT8 어텐션인 SageBwd를 소개했습니다. 그러나 SageBwd는 사전 학습 과정에서 완전 정밀도 어텐션(FPA) 대비 지속적인 성능 격차를 보였습니다. 본 연구에서는 이러한 격차가 발생하는 원인을 규명하고, SageBwd가 사전 학습 중 FPA 성능에 도달할 수 있음을 입증합니다. 실험과 이론적 분석을 통해 다음과 같은 중요한 통찰과 결론을 도출했습니다: (i) QK 정규화는 단계당 대규모 토큰 처리 시 안정적인 학습에 필요하며, (ii) 양자화 오류는 주로 역전파 점수 기울기 dS에서 발생하고, (iii) 단계당 토큰 수 감소를 통해 SageBwd가 사전 학습에서 FPA 성능을 달성할 수 있으며, (iv) K-스무딩은 학습 안정성에 여전히 필수적인 반면, Q-스무딩은 사전 학습 중 제한된 이점만을 제공합니다.

English

Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications while preserving fine-tuning performance. However, SageBwd exhibited a persistent performance gap to full-precision attention (FPA) during pre-training. In this work, we investigate why this gap occurs and demonstrate that SageBwd matches full-precision attention during pretraining. Through experiments and theoretical analysis, we reach a few important insights and conclusions: (i) QK-norm is necessary for stable training at large tokens per step, (ii) quantization errors primarily arise from the backward-pass score gradient dS, (iii) reducing tokens per step enables SageBwd to match FPA performance in pre-training, and (iv) K-smoothing remains essential for training stability, while Q-smoothing provides limited benefit during pre-training.

SageBwd: 훈련 가능한 저비트 어텐션

SageBwd: A Trainable Low-bit Attention

초록

Support