低精度Transformer训练失败之因：基于Flash Attention的分析

摘要

追求計算效率促使了低精度格式在訓練變換器模型中的廣泛採用。然而，這一進展常因眾所周知的訓練不穩定性而受阻。本文首次對一個長期未解的失敗案例提供了機制性解釋，即在低精度設置下使用閃爍注意力進行訓練時，會導致災難性的損失爆炸。我們深入的分析揭示，這一失敗並非隨機現象，而是由兩個相互交織的現象所引起：注意力機制內相似低秩表示的出現，以及低精度算術中固有偏見舍入誤差的累積效應。我們展示了這些因素如何形成一個錯誤積累的惡性循環，從而破壞權重更新，最終導致訓練動態失控。為驗證我們的發現，我們對閃爍注意力進行了最小程度的修改，以減輕舍入誤差中的偏見。這一簡單的改變穩定了訓練過程，證實了我們的分析，並為這一持久問題提供了實際的解決方案。

English

The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosions. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem.

低精度Transformer训练失败之因：基于Flash Attention的分析

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

摘要

Support