저정밀도 트랜스포머 훈련이 실패하는 이유: Flash Attention에 대한 분석

초록

계산 효율성 추구는 트랜스포머 모델 학습을 위해 저정밀도 형식의 도입을 이끌어왔습니다. 그러나 이러한 진전은 종종 악명 높은 학습 불안정성에 의해 방해받곤 합니다. 본 논문은 저정밀도 설정에서 플래시 어텐션을 사용한 학습이 치명적인 손실 폭발로 이어지는 오랜 미해결 실패 사례에 대한 첫 번째 기계적 설명을 제공합니다. 심층 분석을 통해 이 실패가 무작위적 현상이 아니라 어텐션 메커니즘 내에서 유사한 저랭크 표현의 출현과 저정밀도 연산에 내재된 편향된 반올림 오류의 누적 효과라는 두 가지 상호 연관된 현상에 의해 발생함을 밝혔습니다. 이러한 요인들이 어떻게 오류 누적의 악순환을 만들어 가중치 업데이트를 손상시키고 궁극적으로 학습 역학을 무너뜨리는지 보여줍니다. 우리의 발견을 검증하기 위해, 플래시 어텐션에 최소한의 수정을 가해 반올림 오류의 편향을 완화하는 방법을 소개합니다. 이 간단한 변경은 학습 과정을 안정화시켜 우리의 분석을 확인하고 이 오랜 문제에 대한 실용적인 해결책을 제시합니다.

English

The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosions. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem.

저정밀도 트랜스포머 훈련이 실패하는 이유: Flash Attention에 대한 분석

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

초록

Support