低精度Transformerトレーニングが失敗する理由：Flash Attentionの分析

要旨

計算効率の追求により、トランスフォーマーモデルの学習において低精度フォーマットの採用が進んでいる。しかし、この進展はしばしば顕著な学習不安定性によって妨げられる。本論文は、低精度設定でのフラッシュアテンションを用いた学習が破滅的な損失爆発を引き起こすという、長年にわたって未解決の失敗事例に対する初の機構的説明を提供する。詳細な分析を通じて、この失敗はランダムな現象ではなく、アテンションメカニズム内での類似した低ランク表現の出現と、低精度演算に内在するバイアス付き丸め誤差の複合効果という二つの相互に関連する現象によって引き起こされることが明らかとなった。これらの要因が誤差蓄積の悪循環を生み出し、重み更新を破壊し、最終的に学習ダイナミクスを崩壊させる過程を実証する。我々の知見を検証するため、フラッシュアテンションに最小限の修正を加え、丸め誤差のバイアスを軽減する手法を提案する。この単純な変更により学習プロセスが安定化し、我々の分析が確認されるとともに、この永続的な問題に対する実用的な解決策が提供される。

English

The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosions. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem.

低精度Transformerトレーニングが失敗する理由：Flash Attentionの分析

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

要旨

Support